CoNLL-U Parser

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.

Why should you use conllu?

It's simple. ~300 lines of code.
It has no dependencies
Full typing support so your editor can do autocompletion
Nice set of tests with CI setup:
It has 100% test branch coverage (and has undergone mutation testing)
It has

Installation

Note: As of conllu 5.0, Python 3.8 is required to install conllu. See Notes on updating from 4.0 to 5.0

pip install conllu

Or, if you are using conda:

conda install -c conda-forge conllu

Notes on updating from 5.0 to 6.0

Conllu version 6.0 drops support for one method from the public API: parse_conllu_plus_fields. This is no longer needed as we have refactored how fields are read. You likely didn't use this function, but this was part of the public API, so I'm releasing a new major version.

Notes on updating from 4.0 to 5.0

Conllu version 5.0 drops support for Python 3.6 and 3.7 and requires Python 3.8 at a minimum. If you need support for older versions of python, you can always pin your install to an old version of conllu. You can install it with pip install conllu==4.5.3.

Notes on updating from 3.0 to 4.0

Conllu version 4.0 drops support for Python 2 and all versions of earlier than Python 3.6. If you need support for older versions of python, you can always pin your install to an old version of conllu. You can install it with pip install conllu==3.1.1.

Notes on updating from 2.0 to 3.0

The Universal dependencies 2.0 release changed two of the field names from xpostag -> xpos and upostag -> upos. Version 3.0 of conllu handles this by aliasing the previous names to the new names. This means you can use xpos/upos or xpostag/upostag, they will both return the same thing. This does change the public API slightly, so I've upped the major version to 3.0, but I've taken care to ensure you most likely DO NOT have to update your code when you update to 3.0.

Notes on updating from 0.1 to 1.0

I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .

Example usage

At the top level, conllu provides two methods, parse and parse_tree. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.

Use parse() to parse into a list of sentences

>>> from conllu import parse
>>> 
>>> data = """
... # text = The quick brown fox jumps over the lazy dog.
... 1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
... 2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
... 3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
... 4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
... 5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
... 6   over    over   ADP    IN   _                           9   case    _   _
... 7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
... 8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
... 9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
... 10  .       .      PUNCT  .    _                           5   punct   _   _
...
... """

Now you have the data in a variable called data. Let's parse it:

>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>]

Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_incr() instead of parse. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:

from io import open
from conllu import parse_incr

data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist)

For most files, parse works fine.

</blockquote>

Since one CoNLL-U file usually contains multiple sentences, parse() always returns a list of sentences. Each sentence is represented by a TokenList.

>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>

The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:

>>> token = sentence[0]
>>> token
{'id': 1,
     'form': 'The',
     'lemma': 'the',
     ...}
>>> token["form"]
'The'

New in conllu 2.0: `filter()` a TokenList

>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>
>>> sentence.filter(form="quick")
TokenList<quick>

By using filter(field1__field2=value) you can filter based on subelements further down in a parsed token.

>>> sentence.filter(feats__Degree="Pos")
TokenList<quick, brown, lazy>

Filters can also be chained (meaning you can do sentence.filter(...).filter(...)), and filtering on multiple properties at the same time (sentence.filter(field1=value1, field2=value2)) means that ALL properties must match.

New in conllu 4.3: `filter()` a TokenList by lambda

You can also filter using a lambda function as value. This is useful if you, for instance, would like to filter out only tokens with integer ID:s:

>>> from conllu.models import TokenList, Token
>>> sentence2 = TokenList([
...    Token(id=(1, "-", 2), form="It's"),
...    Token(id=1, form="It"),
...    Token(id=2, form="is"),
... ])
>>> sentence2
TokenList<It's, It, is>
>>> sentence2.filter(id=lambda x: type(x) is int)
TokenList<It, is>

Writing data back to a TokenList

If you want to change your CoNLL-U file, there are a couple of convenience methods to know about.

You can add a new token by simply appending a dictionary with the fields you want to a TokenList:

>>> sentence3 = TokenList([
...    {"id": 1, "form": "Lazy"},
...    {"id": 2, "form": "fox"},
... ])
>>> sentence3
TokenList<Lazy, fox>
>>> sentence3.append({"id": 3, "form": "box"})
>>> sentence3
TokenList<Lazy, fox, box>

Changing a sentence just means indexing into it, and setting a value to what you want:

>>> sentence4 = TokenList([
...    {"id": 1, "form": "Lazy"},
...    {"id": 2, "form": "fox"},
... ])
>>> sentence4[1]["form"] = "crocodile"
>>> sentence4
TokenList<Lazy, crocodile>
>>> sentence4[1] = {"id": 2, "form": "elephant"}
>>> sentence4
TokenList<Lazy, elephant>

If you omit a field when passing in a dict, conllu will fill in a "_" for those values.

>>> sentences = parse("1  The")
>>> sentences[0].append({"id": 2})
>>> sentences[0]
TokenList<The, _>

Parse metadata from a CoNLL-U file

Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata.

>>> sentence.metadata
{'text': 'The quick brown fox jumps over the lazy dog.'}

Turn a TokenList back into CoNLL-U

If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize() method:

>>> print(sentence.serialize())
# text = The quick brown fox jumps over the lazy dog.
1   The     the     DET    DT   Definite=Def|PronType=Art   4   det    _   _
2   quick   quick   ADJ    JJ   Degree=Pos                  4   amod   _   _
3   brown   brown   ADJ    JJ   Degree=Pos                  4   amod   _   _
4   fox     fox     NOUN   NN   Number=Sing                 5   nsubj  _   _
5   jumps   jump    VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root   _   _
6   over    over    ADP    IN   _                           9   case   _   _
7   the     the     DET    DT   Definite=Def|PronType=Art   9   det    _   _
8   lazy    lazy    ADJ    JJ   Degree=Pos                  9   amod   _   _
9   dog     dog     NOUN   NN   Number=Sing                 5   nmod   _   SpaceAfter=No
10  .       .       PUNCT  .    _                           5   punct  _   _

Turn a TokenList into a TokenTree (see below)

You can also convert a TokenList to a TokenTree by using to_tree:

>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>

That's it!

Use parse_tree() to parse into a list of dependency trees

Sometimes you're interested in the tree structure that hides in the head column of a CoNLL-U file. When this is the case, use parse_tree to get a nested structure representing the sentence.

>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]

Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by u

Conllu

Install / Use

README

CoNLL-U Parser

Why should you use conllu?

Installation

Notes on updating from 5.0 to 6.0

Notes on updating from 4.0 to 5.0

Notes on updating from 3.0 to 4.0

Notes on updating from 2.0 to 3.0

Notes on updating from 0.1 to 1.0

Example usage

Use parse() to parse into a list of sentences

New in conllu 2.0: `filter()` a TokenList

New in conllu 4.3: `filter()` a TokenList by lambda

Writing data back to a TokenList

Parse metadata from a CoNLL-U file

Turn a TokenList back into CoNLL-U

Turn a TokenList into a TokenTree (see below)

Use parse_tree() to parse into a list of dependency trees

Conllu

Install / Use

README

CoNLL-U Parser

Why should you use conllu?

Installation

Notes on updating from 5.0 to 6.0

Notes on updating from 4.0 to 5.0

Notes on updating from 3.0 to 4.0

Notes on updating from 2.0 to 3.0

Notes on updating from 0.1 to 1.0

Example usage

Use parse() to parse into a list of sentences

New in conllu 2.0: filter() a TokenList

New in conllu 4.3: filter() a TokenList by lambda

Writing data back to a TokenList

Parse metadata from a CoNLL-U file

Turn a TokenList back into CoNLL-U

Turn a TokenList into a TokenTree (see below)

Use parse_tree() to parse into a list of dependency trees

New in conllu 2.0: `filter()` a TokenList

New in conllu 4.3: `filter()` a TokenList by lambda