Trapiche
Multi-source biome classification from text and taxonomy
Install / Use
/learn @Finn-Lab/TrapicheREADME
Trapiche — Multi-source biome classification from text and taxonomy
Trapiche is an open-source tool for biome classification in metagenomic studies. The primary interface is external text predictions: you supply pre-computed biome labels (from manual curation, an external LLM, or any other source) directly in the input, and Trapiche uses them as constraints to guide its taxonomy-based deep classifier. A built-in BERT classifier is available as a lightweight fallback when no external labels are provided.
Trapiche combines two complementary sources of information:
- Text-based (primary): pre-computed biome labels supplied via
ext_text_pred_project/ext_text_pred_sample, or — as a fallback — the built-in LLM-based classifier operating on free-text project/sample descriptions. - Taxonomy-based: a community-embedding of taxonomic profiles is fed to a feed-forward model for deep biome lineage prediction.
By integrating both views, Trapiche improves accuracy and robustness in biome classification.

Install
Requirements
- Python 3.10+
- Linux/macOS recommended (CPU or CUDA GPU)
From source
- Clone this repository
- Install the package and dependencies
By default TensorFlow is optional. Choose the extra that matches your needs:
# Clone
git clone https://github.com/Finn-Lab/trapiche.git
cd trapiche
# Install without TensorFlow (default)
pip install .
# Install with CPU-only TensorFlow
pip install .[cpu]
# Install with GPU TensorFlow
pip install .[gpu]
Quick start (CLI)
The CLI expects NDJSON (one JSON object per line). Each object represents one sample.
Required/optional keys per sample:
Text predictions (primary — recommended)
ext_text_pred_project(optional): list of biome labels for the project, from manual curation or an external LLM (e.g.["root:Environmental:Aquatic:Marine"]). If this key is present in any sample in the batch, the internal BERT classifier is skipped for the entire batch.ext_text_pred_sample(optional): list of biome labels for this specific sample. Used together withext_text_pred_projectwhen the sample-over-study heuristic is enabled.
Text predictions (fallback — internal BERT classifier)
project_description_text(optional): free text describing the sample/project. Used only when no external labels are present.project_description_file_path(optional): path to a text file with the description. Ignored whenproject_description_textis provided.sample_description_text(optional): additional text for the specific sample. Used when the sample-over-study heuristic is enabled.
Taxonomy predictions
sample_taxonomy_paths(required for taxonomy predictions): list of file paths.- Accepted formats: .tsv, .tsv.gz (non-recursive).
Optional identifiers and study-level input
project_id(optional): identifier to group samples into a project/study.sample_id(optional): identifier of the sample within the study.taxonomy_study_tsv(optional): path to a study-level taxonomy summary TSV.- When provided, this is used instead of
sample_taxonomy_paths. - Requires both
project_idandsample_idto be present. - The TSV file is loaded once per unique path and cached for reuse.
- Rows are looked up by
sample_idwhen deriving per-sample taxonomy data.
- When provided, this is used instead of
Example input using external labels (recommended):
{"ext_text_pred_project": ["root:Environmental:Aquatic:Marine"], "sample_taxonomy_paths": ["test/files/taxonomy_files/ERZ34590789/ERZ34590789_FASTA_diamond.tsv.gz"]}
{"ext_text_pred_project": ["root:Environmental:Terrestrial:Soil"], "ext_text_pred_sample": ["root:Environmental:Terrestrial:Soil:Agricultural"], "sample_taxonomy_paths": ["test/files/taxonomy_files/ERZ19590789_FASTA_diamond.tsv.gz"]}
Example input using the fallback internal classifier:
{"project_description_text":"Effect of different fertilization treatments on soil microbiome...", "sample_taxonomy_paths":["test/files/taxonomy_files/ERZ34590789/ERZ34590789_FASTA_diamond.tsv.gz","test/files/taxonomy_files/ERZ34590789/ERZ34590789_FASTA_mseq.tsv"]}
{"project_description_file_path":"test/files/text_files/PRJEB42572_project_description.txt","sample_taxonomy_paths":["test/files/taxonomy_files/ERZ19590789_FASTA_diamond.tsv.gz"]}
Run the workflow
# From file to default output path (<input>_trapiche_results.ndjson)
# By default the CLI writes a compact (minimal) result. To disable the
# minimal output and let the workflow params control which
# keys are saved, use the --disable-minimal-result flag.
trapiche input.ndjson
# To explicitly disable the minimal output and keep the full set controlled
# by TrapicheWorkflowParams:
trapiche input.ndjson --disable-minimal-result
# Or read from stdin and write to stdout
cat input.ndjson | trapiche -
# Disable a step
trapiche input.ndjson --no-run-text # no text-based constraints
# Enable/disable the sample-over-study heuristic for text predictions
trapiche input.ndjson --sample-study-text-heuristic
trapiche input.ndjson --no-sample-study-text-heuristic
Flags
--run-text/--no-run-text,--run-vectorise/--no-run-vectorise,--run-taxonomy/--no-run-taxonomy--keep-text-results / --keep-vectorise-results / --keep-taxonomy-results--disable-minimal-result(default: false). When set, the default minimal output is disabled and the final keys saved are controlled byTrapicheWorkflowParams. By default the CLI produces the compact/minimal output (no flag required).--sample-study-text-heuristic(or--no-sample-study-text-heuristic): when both project/sample text labels are present (either external or internal), run prediction on both and keep the union of labels.
Configuration via environment variables
Trapiche CLI and API use Pydantic Settings. You can override defaults with environment variables:
TRAPICHE_RUN_TEXT=true|falseTRAPICHE_RUN_VECTORISE=true|falseTRAPICHE_RUN_TAXONOMY=true|falseTRAPICHE_SAMPLE_STUDY_TEXT_HEURISTIC=true|false
Example:
export TRAPICHE_RUN_TEXT=false
export TRAPICHE_RUN_TAXONOMY=true
trapiche input.ndjson
Quick start (Python API)
End-to-end workflow over sample records
Uses a sequence of dicts (one dict is one sample). The recommended approach is to supply external labels via ext_text_pred_project; the built-in classifier is used automatically as a fallback when those keys are absent.
Text predictions (primary — recommended)
ext_text_pred_project(optional): list of biome labels for the project.ext_text_pred_sample(optional): list of biome labels for this specific sample. Used with the heuristic.
Text predictions (fallback — internal BERT classifier)
project_description_text(optional): free text describing the sample/project.project_description_file_path(optional): path to a text file with the description.sample_description_text(optional): additional text for the specific sample (heuristic only).
Taxonomy predictions
sample_taxonomy_paths(required for taxonomy predictions): list of file paths.- Accepted formats: .tsv, .tsv.gz (non-recursive).
from trapiche.api import TrapicheWorkflowFromSequence
from trapiche.config import TrapicheWorkflowParams
# Recommended: supply external labels — no model download needed for the text step
samples = [
{
"ext_text_pred_project": ["root:Environmental:Aquatic:Marine"],
"sample_taxonomy_paths": [
"test/taxonomy_files/SRR1524511_MERGED_FASTQ_SSU_OTU.tsv",
"test/taxonomy_files/SRR1524511_MERGED_FASTQ_LSU_OTU.tsv"
]
}
]
workflow_params = TrapicheWorkflowParams( # defaults shown
run_text=True, run_vectorise=True, run_taxonomy=True,
keep_text_results=True, keep_vectorise_results=False, keep_taxonomy_results=True, output_keys=None
# When output_keys is None, the keep_* flags decide what to include.
)
runner = TrapicheWorkflowFromSequence(workflow_params=workflow_params)
result = runner.run(samples) # sequence of dicts augmented with predictions
print(result)
runner.save("trapiche_results.ndjson") # optional convenience save
Fallback: internal text prediction from free text
from trapiche.api import TextToBiome
ttb = TextToBiome() # uses default model and device
texts = [x["project_description_text"] for x in samples]
text_predictions = ttb.predict(texts)
print(text_predictions) # list[list[str]]: predicted biome labels per input text
# Optionally save last predictions
ttb.save("text_preds.json")
Taxonomy → community vector → biome lineage
from trapiche.api import Community2vec, TaxonomyToBiome
# Vectorise one or more samples from taxonomy annotation files
c2v = Community2vec()
vectors = c2v.transform(samples)
tax2b = TaxonomyToBiome()
result = tax2b.predict(community_vectors=vectors,constrain=text_predictions)
print(len(result))
print(result[0]) # pandas DataFrame with per-sample predictions
# Optional saves
c2v.save("community_vectors.npy")
tax2b.save("taxonomy_predictions.csv")
tax2b.save_vectors("taxonomy_vectors.npy")
Input schema
Input record (API and CLI workflow)
One JSON object per sample in either NDJSON (CLI) or List (API), with the following keys:
{
"ext_text_pred_project": ["root:Environmental:Aquatic:Marine"],
"ext_text_pred_sample": ["root:Environmental:Aquatic:Marine:Coastal"],
// ^ primary text source: external labels (list of strings, each matching
// 'root:Category[:Subcategory...]'). If present in any sample in the batch,
// the internal BERT classifier is skipped for the entire batch.
"project_description_text": "Free text describing the sample.",
"sample_description_text": "Free text describing this specific sample variant.",
// ^ fallback — internal BERT classifier (used only when ext_
Related Skills
node-connect
349.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
