Xcfg

X (weighted / probabilistic) Context-Free Grammars

Generate Convert Improve

Install / Use

/learn @zhaoyanpeng/Xcfg

About this skill

Quality Score

0/100

README

XCFGs

Aiming at unifying all extensions of context-free grammars (XCFGs). X stands for weighted, (compound) probabilistic, and neural extensions, etc. Currently only the data preprocessing module has been implemented though.

Update (08/06/2023): Support Brown Corpus and English Web Treebank that are used in this study.

Update (06/02/2022): Parse MSCOCO and Flickr30k captions, create data splits, and encode images for VC-PCFG.

Update (03/10/2021): Parallel Chinese-English data is supported.

Data

The repo handles WSJ, CTB, SPMRL, Brown Corpus, and English Web Treebank. Have a look at treebank.py.

If you are looking for the data used in C-PCFGs. Follow the instructions in treebank.py and put all outputs in the same folder, let us say ./data.punct. The script only removes morphology features and creates data splits. To remove punctuation we will need clean_tb.py. For example, I used python clean_tb.py ./data.punct ./data.clean. All the cleaned treebanks will reside in /data.clean. Then simply execute the command ./batchify.sh ./data.clean/, you will have all the data needed to reproduce the results in C-PCFGs. Feel free to change parameters in batchify.sh if you want to use a different batch size or vocabulary size.

Evaluation

To ease evaluation I represent a gold tree as a tuple:

TREE: TUPLE(sentence: STR, spans: LIST[SPAN], span_labels: LIST[STR], pos_tags: LIST[STR])
SPAN: TUPLE(left_boundary: INT, right_boundary: INT)

If you have followed the instructions in the last section, this command ./binarize.sh ./data.clean/ could help you convert gold trees into the tuple representation.

Trivial baselines

Even for trivial baselines, e.g., left- and right-branching trees, you may find different F1 numbers in literature on grammar induction, partly because the authors used (slightly) different procedures for data preprocessing. To encourage truly fair comparison I also released a standard procedure baseline.py. Hopefully, this will help with the situation.

| Model | WSJ | CTB | Basque | German | French | Hebrew | Hungarian | Korean | Polish | Swedish | |:-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:| | LB | 8.7 | 7.2 | 17.9 | 10.0 | 5.7 | 8.5 | 13.3 | 18.5 | 10.9 | 8.4 | | RB | 39.5 | 25.5 | 15.4 | 14.7 | 26.4 | 30.0 | 12.7 | 19.2 | 34.2 | 30.4 |

An evaluation checklist for phrase-structure grammar induction

Below is a comparison of several cirtical training / evaluation settings of recent unsupervised parsing models.

| Model | Sent. F1 | Corpus F1 | Variance | Word repr. | Punct. rm | Length | Dataset | |:-:|-:|-:|-:|-:|-:|-:|-:| | PRPN | ✓ | | | RAW | ✓ | | WSJ | | | ON | ✓ | | | RAW | ✓ | | WSJ | | | DIORA | ✓ | | | ELMo | | | WSJ | | | URNNG | ✓ | | | RAW | ✗ | | WSJ | | | N-PCFG | ✓ | | | RAW | ✓ | | WSJ / CTB | | | C-PCFG | ✓ | | | RAW | ✓ | | WSJ / CTB | | | VG-NSL | ✓ | | ✓ | RAW / FastText | ✗ | | MSCOCO | | | LN-PCFG | ✓ | | | RAW | | | WSJ | | | CT | ✓ | | | RoBERTa | | | WSJ | | | S-DIORA | ✓ | | | ELMo | | | WSJ | | | VC-PCFG | ✓ | ✓ | ✓ | RAW | ✓ | | MSCOCO | | | C-PCFG (Zhao 2020) | ✓ | ✓ | ✓ | RAW | ✓ | | WSJ / CTB / SPMRL | |

Citing XCFGs

If you use XCFGs in your research or wish to refer to the results in C-PCFGs, please use the following BibTeX entry.

@inproceedings{zhao-titov-2023-transferability,
    title = "On the Transferability of Visually Grounded {PCFGs}",
    author = "Zhao, Yanpeng  and Titov, Ivan",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
}

@inproceedings{zhao-titov-2021-empirical,
    title = "An Empirical Study of Compound {PCFG}s",
    author = "Zhao, Yanpeng and Titov, Ivan",
    booktitle = "Proceedings of the Second Workshop on Domain Adaptation for NLP",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.adaptnlp-1.17",
    pages = "166--171",
}

Acknowledgements

batchify.py is borrowed from C-PCFGs.

License

MIT

Related Skills

node-connect

338.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

338.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.4k

Commit, push, and open a PR