Conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
Install / Use
/learn @EmilStenstrom/ConlluREADME
CoNLL-U Parser
CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.
Why should you use conllu?
- It's simple. ~300 lines of code.
- It has no dependencies
- Full typing support so your editor can do autocompletion
- Nice set of tests with CI setup:
- It has 100% test branch coverage (and has undergone mutation testing)
- It has
Installation
Note: As of conllu 5.0, Python 3.8 is required to install conllu. See Notes on updating from 4.0 to 5.0
pip install conllu
Or, if you are using conda:
conda install -c conda-forge conllu
Notes on updating from 5.0 to 6.0
Conllu version 6.0 drops support for one method from the public API: parse_conllu_plus_fields. This is no longer needed as we have refactored how fields are read. You likely didn't use this function, but this was part of the public API, so I'm releasing a new major version.
Notes on updating from 4.0 to 5.0
Conllu version 5.0 drops support for Python 3.6 and 3.7 and requires Python 3.8 at a minimum. If you need support for older versions of python, you can always pin your install to an old version of conllu. You can install it with pip install conllu==4.5.3.
Notes on updating from 3.0 to 4.0
Conllu version 4.0 drops support for Python 2 and all versions of earlier than Python 3.6. If you need support for older versions of python, you can always pin your install to an old version of conllu. You can install it with pip install conllu==3.1.1.
Notes on updating from 2.0 to 3.0
The Universal dependencies 2.0 release changed two of the field names from xpostag -> xpos and upostag -> upos. Version 3.0 of conllu handles this by aliasing the previous names to the new names. This means you can use xpos/upos or xpostag/upostag, they will both return the same thing. This does change the public API slightly, so I've upped the major version to 3.0, but I've taken care to ensure you most likely DO NOT have to update your code when you update to 3.0.
Notes on updating from 0.1 to 1.0
I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .
Example usage
At the top level, conllu provides two methods, parse and parse_tree. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.
Use parse() to parse into a list of sentences
>>> from conllu import parse
>>>
>>> data = """
... # text = The quick brown fox jumps over the lazy dog.
... 1 The the DET DT Definite=Def|PronType=Art 4 det _ _
... 2 quick quick ADJ JJ Degree=Pos 4 amod _ _
... 3 brown brown ADJ JJ Degree=Pos 4 amod _ _
... 4 fox fox NOUN NN Number=Sing 5 nsubj _ _
... 5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
... 6 over over ADP IN _ 9 case _ _
... 7 the the DET DT Definite=Def|PronType=Art 9 det _ _
... 8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
... 9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
... 10 . . PUNCT . _ 5 punct _ _
...
... """
Now you have the data in a variable called data. Let's parse it:
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>]
<blockquote>
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using parse_incr() instead of parse. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:
from io import open
from conllu import parse_incr
data_file = open("huge_file.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
print(tokenlist)
For most files, parse works fine.
Since one CoNLL-U file usually contains multiple sentences, parse() always returns a list of sentences. Each sentence is represented by a TokenList.
>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>
The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:
>>> token = sentence[0]
>>> token
{'id': 1,
'form': 'The',
'lemma': 'the',
...}
>>> token["form"]
'The'
New in conllu 2.0: filter() a TokenList
>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>
>>> sentence.filter(form="quick")
TokenList<quick>
By using filter(field1__field2=value) you can filter based on subelements further down in a parsed token.
>>> sentence.filter(feats__Degree="Pos")
TokenList<quick, brown, lazy>
Filters can also be chained (meaning you can do sentence.filter(...).filter(...)), and filtering on multiple properties at the same time (sentence.filter(field1=value1, field2=value2)) means that ALL properties must match.
New in conllu 4.3: filter() a TokenList by lambda
You can also filter using a lambda function as value. This is useful if you, for instance, would like to filter out only tokens with integer ID:s:
>>> from conllu.models import TokenList, Token
>>> sentence2 = TokenList([
... Token(id=(1, "-", 2), form="It's"),
... Token(id=1, form="It"),
... Token(id=2, form="is"),
... ])
>>> sentence2
TokenList<It's, It, is>
>>> sentence2.filter(id=lambda x: type(x) is int)
TokenList<It, is>
Writing data back to a TokenList
If you want to change your CoNLL-U file, there are a couple of convenience methods to know about.
You can add a new token by simply appending a dictionary with the fields you want to a TokenList:
>>> sentence3 = TokenList([
... {"id": 1, "form": "Lazy"},
... {"id": 2, "form": "fox"},
... ])
>>> sentence3
TokenList<Lazy, fox>
>>> sentence3.append({"id": 3, "form": "box"})
>>> sentence3
TokenList<Lazy, fox, box>
Changing a sentence just means indexing into it, and setting a value to what you want:
>>> sentence4 = TokenList([
... {"id": 1, "form": "Lazy"},
... {"id": 2, "form": "fox"},
... ])
>>> sentence4[1]["form"] = "crocodile"
>>> sentence4
TokenList<Lazy, crocodile>
>>> sentence4[1] = {"id": 2, "form": "elephant"}
>>> sentence4
TokenList<Lazy, elephant>
If you omit a field when passing in a dict, conllu will fill in a "_" for those values.
>>> sentences = parse("1 The")
>>> sentences[0].append({"id": 2})
>>> sentences[0]
TokenList<The, _>
Parse metadata from a CoNLL-U file
Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata.
>>> sentence.metadata
{'text': 'The quick brown fox jumps over the lazy dog.'}
Turn a TokenList back into CoNLL-U
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize() method:
>>> print(sentence.serialize())
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
Turn a TokenList into a TokenTree (see below)
You can also convert a TokenList to a TokenTree by using to_tree:
>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>
That's it!
Use parse_tree() to parse into a list of dependency trees
Sometimes you're interested in the tree structure that hides in the head column of a CoNLL-U file. When this is the case, use parse_tree to get a nested structure representing the sentence.
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]
<blockquote>
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by u
