CrossSemparse
Code for the paper: "Cross-domain Semantic Parsing via Paraphrasing" - EMNLP 2017
Install / Use
/learn @ysu1989/CrossSemparseREADME
Cross-Domain Semantic Parsing / Natural Language Interface
The Cold Start Problem
Semantic parsing, which maps natural language utterances into computer-understandable logical forms, has drawn substantial attention recently as a promising direction for developing natural language interfaces to computers. There are so many domains (healthcare, finance, IoT, sports, etc.) for which we can build a natural language interface, making portability / scalability an impending challenge. Or in other words, it's the cold start problem of natural language interface:
<p align="center"><i>Given a new domain, how can we build a natural language interface for it?</i></p> <p align="center"> <img align="center" src="misc/cold_start.jpg" alt="Cold Start Problem" width="500px"/> </p>Solution
There are three complementary ways to solve the cold start problem:
<p align="center"> <img align="center" src="misc/cold_start_nli.png" alt="Cold Start Solution" width="500px"/> </p>- Re-use the training data for some existing domains via transfer learning (this repo)
- Collect training data for the new domain via crowdsourcing [1] [2]
- Once we cold-started a natural language interface with reasonable performance, develop some user-friendly interaction mechanism, deploy the system and let it interact with real users so it can keep refining itself [3] [4]
Use of This Repo
Requirements
- Python 2.7
- Tensorflow 0.11 (yes the TF version is a bit old..but it's still working reasonably well!)
- Yaml (for logging)
Setup
Install Tensorflow 0.11:
(GPU support)
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0-cp27-none-linux_x86_64.whl
pip install --ignore-installed --upgrade $TF_BINARY_URL
(CPU only)
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0-cp27-none-linux_x86_64.whl
pip install --ignore-installed --upgrade $TF_BINARY_URL
Install other dependencies:
pip install pyyaml
Training
We use the Overnight dataset, which contains 8 domains including Basketball, Calendar, and Restaurants. The dataset is already pre-processed and can be found under data.
Assume we are at the the root of the repo. All of training and testing can be done with the following command:
sh scripts/batch_run.sh 0 train_grid_unit_var overnight 0
The arguments are:
GPU ID: which GPU to use for this run?Training Script: each word embedding initialization has a separate scriptDataset: for now, the only option isovernightExecution Number: a unique number for this execution. A corresponding dir will be created underexecs/to host the trained model and the log of this execution.
The command will do the following tasks for each of the 8 domains:
- In-domain: Train and test
- In-domain: Re-train with the full training data (i.e., training+validation) and then test (final results for in-domain setting)
- Cross-domain: Pre-training on the source domains
- Cross-domain: Warm-start on target domain with pre-trained model, fine-tune with in-domain data, and test
- Cross-domain: Re-train with full in-domain training data and then test (final results for cross-domain setting)
It's easy to train for another word embedding initialization strategy, e.g., original word2vec embedding without standardization. Just change the training script and execution number:
sh scripts/batch_run.sh 0 train_grid_original overnight 1
Extract Testing Results
We provide a script to make it easy to extract the testing results across all of the domains. For example,
In-domain, exec_num=0, re-training with full training data:
python scripts/extract_test_result.py in-domain overnight 0 1
In-domain, exec_num=0, no re-training:
python scripts/extract_test_result.py in-domain overnight 0 0
Cross-domain, exec_num=5, re-training with full training data:
python scripts/extract_test_result.py cross-domain overnight 5 1
References
Please refer to the following paper for more details. If you find it useful, please kindly consider to cite:
@InProceedings {su2017cross,
author = {Su, Yu and Yan, Xifeng},
title = {Cross-domain Semantic Parsing via Paraphrasing},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
pages = {1235--1246},
year = {2017},
address = {Copenhagen, Denmark},
month = {Sept},
publisher = {Association for Computational Linguistics}
}
Other references for cold-starting a natural language interface
<a name="reference1"></a>[1] Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, Mark Encarnacion. Building Natural Language Interfaces to Web APIs. CIKM 2017.
<a name="reference2"></a>[2] Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gur, Zenghui Yan, Xifeng Yan. On Generating Characteristic-rich Question Sets for QA Evaluation. EMNLP 2016.
<a name="reference3"></a>[3] Izzeddin Gur, Semih Yavuz, Yu Su, Xifeng Yan. DialSQL: Dialogue Based Structured Query Generation. ACL 2018.
<a name="reference4"></a>[4] Yu Su, Ahmed Hassan Awadallah, Miaosen Wang, Ryen White. Natural Language Interfaces with Fine-Grained User Interaction: A Case Study on Web APIs. SIGIR 2018
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
