Lefex
A tool for extraction of lexical features from text based on UIMA and MapReduce
Install / Use
/learn @uhh-lt/LefexREADME
lefex: A Tool for LExical FEature eXtraction
This project contains Hadoop jobs for extraction of features of words and texts. Currently, the following types of features can be extracted:
- CoNLL. Given a set of HTML documents in the CSV format
url<TAB>s3-path<TAB>html-documentand outputs the dependency parsed documents in the CoNLL format. See thede.uhh.lt.lefex.CoNLL.HadoopMainclass. - ExtractTermFeatureScores. Given a corpus in plain text format, extract word count (
word<TAB>count), feature count (feature<TAB>count), and word-feature count (word<TAB>feature<TAB>count) and save these into CSV files. This job is used for feature extraction in the JoSimText project: the computation of distributional thesaurus can be performed taking as input the output of this job. See thede.uhh.lt.lefex.ExtractTermFeatureScores.HadoopMainclass. - ExtractLexicalSampleFeatureScores. Given a lexical sample dataset for word sense disambiguation in CSV format, extract features of the target word in context and add them as an extra column. Currently, the system supports extraction of three types of features of a target word:
co-occurrences, dependency features, and trigrams. See the
de.uhh.lt.lefex.ExtractLexicalSampleFeatures.HadoopMainclass. - SentenceSplitter. This job take a plain text corpus as an input and outputs a file with exactly one sentence per line. See the
de.uhh.lt.lefex.SentenceSplitter.HadoopMainclass.
To build the project you may need to install a JoBimText jar file which contains a custom (non mavenified) dependency collapsing UIMA annotator. To do it use the following script.
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
