Xcfg
X (weighted / probabilistic) Context-Free Grammars
Install / Use
/learn @zhaoyanpeng/XcfgREADME
XCFGs
Aiming at unifying all extensions of context-free grammars (XCFGs). X stands for weighted, (compound) probabilistic, and neural extensions, etc. Currently only the data preprocessing module has been implemented though.
Update (08/06/2023): Support Brown Corpus and English Web Treebank that are used in this study.
Update (06/02/2022): Parse MSCOCO and Flickr30k captions, create data splits, and encode images for VC-PCFG.
Update (03/10/2021): Parallel Chinese-English data is supported.
Data
The repo handles WSJ, CTB, SPMRL, Brown Corpus, and English Web Treebank. Have a look at treebank.py.
If you are looking for the data used in C-PCFGs. Follow the instructions in treebank.py and put all outputs in the same folder, let us say ./data.punct. The script only removes morphology features and creates data splits. To remove punctuation we will need clean_tb.py. For example, I used python clean_tb.py ./data.punct ./data.clean. All the cleaned treebanks will reside in /data.clean. Then simply execute the command ./batchify.sh ./data.clean/, you will have all the data needed to reproduce the results in C-PCFGs. Feel free to change parameters in batchify.sh if you want to use a different batch size or vocabulary size.
Evaluation
To ease evaluation I represent a gold tree as a tuple:
TREE: TUPLE(sentence: STR, spans: LIST[SPAN], span_labels: LIST[STR], pos_tags: LIST[STR])
SPAN: TUPLE(left_boundary: INT, right_boundary: INT)
If you have followed the instructions in the last section, this command ./binarize.sh ./data.clean/ could help you convert gold trees into the tuple representation.
Trivial baselines
Even for trivial baselines, e.g., left- and right-branching trees, you may find different F1 numbers in literature on grammar induction, partly because the authors used (slightly) different procedures for data preprocessing. To encourage truly fair comparison I also released a standard procedure baseline.py. Hopefully, this will help with the situation.
| Model | WSJ | CTB | Basque | German | French | Hebrew | Hungarian | Korean | Polish | Swedish | |:-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:| | LB | 8.7 | 7.2 | 17.9 | 10.0 | 5.7 | 8.5 | 13.3 | 18.5 | 10.9 | 8.4 | | RB | 39.5 | 25.5 | 15.4 | 14.7 | 26.4 | 30.0 | 12.7 | 19.2 | 34.2 | 30.4 |
An evaluation checklist for phrase-structure grammar induction
Below is a comparison of several cirtical training / evaluation settings of recent unsupervised parsing models.
| Model | Sent. F1 | Corpus F1 | Variance | Word repr. | Punct. rm | Length | Dataset | |:-:|-:|-:|-:|-:|-:|-:|-:| | PRPN | ✓ | | | RAW | ✓ | | WSJ | | | ON | ✓ | | | RAW | ✓ | | WSJ | | | DIORA | ✓ | | | ELMo | | | WSJ | | | URNNG | ✓ | | | RAW | ✗ | | WSJ | | | N-PCFG | ✓ | | | RAW | ✓ | | WSJ / CTB | | | C-PCFG | ✓ | | | RAW | ✓ | | WSJ / CTB | | | VG-NSL | ✓ | | ✓ | RAW / FastText | ✗ | | MSCOCO | | | LN-PCFG | ✓ | | | RAW | | | WSJ | | | CT | ✓ | | | RoBERTa | | | WSJ | | | S-DIORA | ✓ | | | ELMo | | | WSJ | | | VC-PCFG | ✓ | ✓ | ✓ | RAW | ✓ | | MSCOCO | | | C-PCFG (Zhao 2020) | ✓ | ✓ | ✓ | RAW | ✓ | | WSJ / CTB / SPMRL | |
Citing XCFGs
If you use XCFGs in your research or wish to refer to the results in C-PCFGs, please use the following BibTeX entry.
@inproceedings{zhao-titov-2023-transferability,
title = "On the Transferability of Visually Grounded {PCFGs}",
author = "Zhao, Yanpeng and Titov, Ivan",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}
@inproceedings{zhao-titov-2021-empirical,
title = "An Empirical Study of Compound {PCFG}s",
author = "Zhao, Yanpeng and Titov, Ivan",
booktitle = "Proceedings of the Second Workshop on Domain Adaptation for NLP",
month = apr,
year = "2021",
address = "Kyiv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.adaptnlp-1.17",
pages = "166--171",
}
Acknowledgements
batchify.py is borrowed from C-PCFGs.
License
MIT
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
