FOBIE

FOBIE dataset and code for Semi-Open Relation Extraction, applied to Biology for Computer-Aided Biomimetics.

Generate Convert Improve

Install / Use

/learn @rubenkruiper/FOBIE

About this skill

Quality Score

0/100

README

Semi-Open Relation Extraction

The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.

The FOBIE dataset has been used to explore Semi-Open Relation Extraction (SORE). The code for this and instructions can be found inside the SORE folder Readme.md, or in the ReadTheDocs documentations.

Format

The train/test/dev data files are provided in two formats. A verbose json format inspired on the Semeval2018 task 7 dataset:

{"[document_ID]":
  {"[relation_ID_within_document]":
    {"annotations":
      {"modifiers":
        {"[within_sentence_modifier_ID]":
          {"Arg0": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",
                    "text": "[string]"},
           "Arg1": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",
                    "text": "[string]"}
          }
       },
     "tradeoffs":
        {"[within_sentence_tradeoff_ID]":
          {"Arg0": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",  
                    "text": "[string]"},
          "Arg1": {"span_start": "[token_index]",
                   "span_end": "[token_index]",
                   "span_id": "[brat_ID]",  
                   "text": "[string]"},           
          "TO_indicator": {"span_start": "[token_index]",
                           "span_end": "[token_index]",
                           "span_id": "[brat_ID]",  
                           "text": "[string]"},
          "labels": {"Confidence": "High"}
        }
      }
    },
    "sentence": "[string]"
  }
},

And the Sci-ERC dataset format, which is used to train the SciIE system:

{   "clusters": [],
    "sentences": [["List", "of", "some", "tokens", "."]],
    "ner": [[[4, 4, "Generic"]]],
    "relations": [[[4, 4, 6, 17, "Tradeoff"]]],
    "doc_key": "XXX"}

We also provide a script to convert data from the verbose format to SciIE format, as well as a script to convert BRAT annotations to the verbose format.

Statistics

Also see dataset_statistics.py under the scripts folder. | | Train | Dev | Test | Total | |-----------------------------|-------------|-------|-------|-------| | # Unique documents | # Sentences | Avg. sent. length | % of sents ≥ 25 tokens | Relations: | - Trade-Off | - Not-a-Trade-Off | - Arg-Modifier | Triggers | Keyphrases | Keyphrases w/ multiple relationsSpans | Max relations/sent | Max spans/sent | Max triggers/sent | Max args/trigger | Unique spans | Unique triggers | # single-word keyphrasesAvg. tokens per keyphrase | 1010 | 138 | 144 | 1292 | | 1248 | 150 | 150 | 1548 | | 37.42 | 38.91 | 40.02 | 37.81 | | 82.21 % | 85.33 % | 83.33 % | 82.62% | | | | | | | 639 | 54 | 72 | 765 | | 2004 | 258 | 240 | 2502 | | 1247 | 142 | 132 | 1521 | | 1292 | 155 | 153 | 1600 | | 3436 | 401 | 398 | 4235 | --> | 1600 | 188 | 163 | 1951 | | 4728 | 556 | 551 | 5835 | | 9 | 8 | 8 |
| 9 | 8 | 8 | | 2 | 2 | 2 |
| 5 | 4 | 4 |
| | | |3643 |
| | | |41 |
--> | | | |864 (20.4%) | | | | |3.46 |

If you use the FOBIE dataset or SORE code in your research, please consider citing the following papers:

@inproceedings{Kruiper2020_SORE,
  author =      "Kruiper, Ruben
                and Vincent, Julian F V
                and Chen-Burger, Jessica
                and Desmulliez, Marc P Y
                and Konstas, Ioannis",
  title =       "In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts"
  year =        "2020",
  url =         "https://arxiv.org/pdf/2005.07751.pdf",
  arxivId =     "2005.07751"
}

@inproceedings{Kruiper2020_FOBIE,
  author =      "Kruiper, Ruben
                and Vincent, Julian F V
                and Chen-Burger, Jessica
                and Desmulliez, Marc P Y
                and Konstas, Ioannis",
  title =       "A Scientific Information Extraction Dataset for Nature Inspired Engineering"
  booktitle =   "Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)",
  year =        "2020",
  keywords =    "Biomimetics,Relation Extraction,Scientific Information Extraction,Trade-Offs",
  pages =       "2078--2085",
  url =         "http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.255.pdf",
  arxivId =     "2005.07753"
}

The FOBIE dataset along with SORE code in this repository are licensed under a Creative Commons Attribution 4.0 License. <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png" width="134" height="47">

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。