SkillAgentSearch skills...

Trapiche

Multi-source biome classification from text and taxonomy

Install / Use

/learn @Finn-Lab/Trapiche
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="left"> <img src="assets/logo.svg" width="350" alt="Trapiche logo"> </p>

Trapiche — Multi-source biome classification from text and taxonomy

Trapiche is an open-source tool for biome classification in metagenomic studies. The primary interface is external text predictions: you supply pre-computed biome labels (from manual curation, an external LLM, or any other source) directly in the input, and Trapiche uses them as constraints to guide its taxonomy-based deep classifier. A built-in BERT classifier is available as a lightweight fallback when no external labels are provided.

Trapiche combines two complementary sources of information:

  • Text-based (primary): pre-computed biome labels supplied via ext_text_pred_project / ext_text_pred_sample, or — as a fallback — the built-in LLM-based classifier operating on free-text project/sample descriptions.
  • Taxonomy-based: a community-embedding of taxonomic profiles is fed to a feed-forward model for deep biome lineage prediction.

By integrating both views, Trapiche improves accuracy and robustness in biome classification.

Install

Requirements

  • Python 3.10+
  • Linux/macOS recommended (CPU or CUDA GPU)

From source

  1. Clone this repository
  2. Install the package and dependencies

By default TensorFlow is optional. Choose the extra that matches your needs:

# Clone
git clone https://github.com/Finn-Lab/trapiche.git
cd trapiche

# Install without TensorFlow (default)
pip install .

# Install with CPU-only TensorFlow
pip install .[cpu]

# Install with GPU TensorFlow
pip install .[gpu]

Quick start (CLI)

The CLI expects NDJSON (one JSON object per line). Each object represents one sample.

Required/optional keys per sample:

Text predictions (primary — recommended)

  • ext_text_pred_project (optional): list of biome labels for the project, from manual curation or an external LLM (e.g. ["root:Environmental:Aquatic:Marine"]). If this key is present in any sample in the batch, the internal BERT classifier is skipped for the entire batch.
  • ext_text_pred_sample (optional): list of biome labels for this specific sample. Used together with ext_text_pred_project when the sample-over-study heuristic is enabled.

Text predictions (fallback — internal BERT classifier)

  • project_description_text (optional): free text describing the sample/project. Used only when no external labels are present.
  • project_description_file_path (optional): path to a text file with the description. Ignored when project_description_text is provided.
  • sample_description_text (optional): additional text for the specific sample. Used when the sample-over-study heuristic is enabled.

Taxonomy predictions

  • sample_taxonomy_paths (required for taxonomy predictions): list of file paths.
    • Accepted formats: .tsv, .tsv.gz (non-recursive).

Optional identifiers and study-level input

  • project_id (optional): identifier to group samples into a project/study.
  • sample_id (optional): identifier of the sample within the study.
  • taxonomy_study_tsv (optional): path to a study-level taxonomy summary TSV.
    • When provided, this is used instead of sample_taxonomy_paths.
    • Requires both project_id and sample_id to be present.
    • The TSV file is loaded once per unique path and cached for reuse.
    • Rows are looked up by sample_id when deriving per-sample taxonomy data.

Example input using external labels (recommended):

{"ext_text_pred_project": ["root:Environmental:Aquatic:Marine"], "sample_taxonomy_paths": ["test/files/taxonomy_files/ERZ34590789/ERZ34590789_FASTA_diamond.tsv.gz"]}
{"ext_text_pred_project": ["root:Environmental:Terrestrial:Soil"], "ext_text_pred_sample": ["root:Environmental:Terrestrial:Soil:Agricultural"], "sample_taxonomy_paths": ["test/files/taxonomy_files/ERZ19590789_FASTA_diamond.tsv.gz"]}

Example input using the fallback internal classifier:

{"project_description_text":"Effect of different fertilization treatments on soil microbiome...", "sample_taxonomy_paths":["test/files/taxonomy_files/ERZ34590789/ERZ34590789_FASTA_diamond.tsv.gz","test/files/taxonomy_files/ERZ34590789/ERZ34590789_FASTA_mseq.tsv"]}
{"project_description_file_path":"test/files/text_files/PRJEB42572_project_description.txt","sample_taxonomy_paths":["test/files/taxonomy_files/ERZ19590789_FASTA_diamond.tsv.gz"]}

Run the workflow

# From file to default output path (<input>_trapiche_results.ndjson)
# By default the CLI writes a compact (minimal) result. To disable the
# minimal output and let the workflow params control which
# keys are saved, use the --disable-minimal-result flag.
trapiche input.ndjson

# To explicitly disable the minimal output and keep the full set controlled
# by TrapicheWorkflowParams:
trapiche input.ndjson --disable-minimal-result

# Or read from stdin and write to stdout
cat input.ndjson | trapiche -

# Disable a step
trapiche input.ndjson --no-run-text  # no text-based constraints

# Enable/disable the sample-over-study heuristic for text predictions
trapiche input.ndjson --sample-study-text-heuristic
trapiche input.ndjson --no-sample-study-text-heuristic

Flags

  • --run-text/--no-run-text, --run-vectorise/--no-run-vectorise, --run-taxonomy/--no-run-taxonomy
  • --keep-text-results / --keep-vectorise-results / --keep-taxonomy-results
  • --disable-minimal-result (default: false). When set, the default minimal output is disabled and the final keys saved are controlled by TrapicheWorkflowParams. By default the CLI produces the compact/minimal output (no flag required).
  • --sample-study-text-heuristic (or --no-sample-study-text-heuristic): when both project/sample text labels are present (either external or internal), run prediction on both and keep the union of labels.

Configuration via environment variables

Trapiche CLI and API use Pydantic Settings. You can override defaults with environment variables:

  • TRAPICHE_RUN_TEXT=true|false
  • TRAPICHE_RUN_VECTORISE=true|false
  • TRAPICHE_RUN_TAXONOMY=true|false
  • TRAPICHE_SAMPLE_STUDY_TEXT_HEURISTIC=true|false

Example:

export TRAPICHE_RUN_TEXT=false
export TRAPICHE_RUN_TAXONOMY=true
trapiche input.ndjson

Quick start (Python API)

End-to-end workflow over sample records

Uses a sequence of dicts (one dict is one sample). The recommended approach is to supply external labels via ext_text_pred_project; the built-in classifier is used automatically as a fallback when those keys are absent.

Text predictions (primary — recommended)

  • ext_text_pred_project (optional): list of biome labels for the project.
  • ext_text_pred_sample (optional): list of biome labels for this specific sample. Used with the heuristic.

Text predictions (fallback — internal BERT classifier)

  • project_description_text (optional): free text describing the sample/project.
  • project_description_file_path (optional): path to a text file with the description.
  • sample_description_text (optional): additional text for the specific sample (heuristic only).

Taxonomy predictions

  • sample_taxonomy_paths (required for taxonomy predictions): list of file paths.
    • Accepted formats: .tsv, .tsv.gz (non-recursive).
from trapiche.api import TrapicheWorkflowFromSequence
from trapiche.config import TrapicheWorkflowParams

# Recommended: supply external labels — no model download needed for the text step
samples = [
	{
		"ext_text_pred_project": ["root:Environmental:Aquatic:Marine"],
		"sample_taxonomy_paths": [
			"test/taxonomy_files/SRR1524511_MERGED_FASTQ_SSU_OTU.tsv",
			"test/taxonomy_files/SRR1524511_MERGED_FASTQ_LSU_OTU.tsv"
		]
	}
]

workflow_params = TrapicheWorkflowParams(  # defaults shown
	run_text=True, run_vectorise=True, run_taxonomy=True,
	keep_text_results=True, keep_vectorise_results=False, keep_taxonomy_results=True, output_keys=None
	# When output_keys is None, the keep_* flags decide what to include.
)

runner = TrapicheWorkflowFromSequence(workflow_params=workflow_params)
result = runner.run(samples)  # sequence of dicts augmented with predictions
print(result)
runner.save("trapiche_results.ndjson")  # optional convenience save

Fallback: internal text prediction from free text

from trapiche.api import TextToBiome

ttb = TextToBiome()  # uses default model and device

texts = [x["project_description_text"] for x in samples]
text_predictions = ttb.predict(texts)
print(text_predictions)  # list[list[str]]: predicted biome labels per input text

# Optionally save last predictions
ttb.save("text_preds.json")

Taxonomy → community vector → biome lineage

from trapiche.api import Community2vec, TaxonomyToBiome

# Vectorise one or more samples from taxonomy annotation files
c2v = Community2vec()

vectors = c2v.transform(samples)

tax2b = TaxonomyToBiome()
result = tax2b.predict(community_vectors=vectors,constrain=text_predictions)
print(len(result))
print(result[0])  # pandas DataFrame with per-sample predictions

# Optional saves
c2v.save("community_vectors.npy")
tax2b.save("taxonomy_predictions.csv")
tax2b.save_vectors("taxonomy_vectors.npy")

Input schema

Input record (API and CLI workflow)

One JSON object per sample in either NDJSON (CLI) or List (API), with the following keys:

{
	"ext_text_pred_project": ["root:Environmental:Aquatic:Marine"],
	"ext_text_pred_sample":  ["root:Environmental:Aquatic:Marine:Coastal"],
	// ^ primary text source: external labels (list of strings, each matching
	//   'root:Category[:Subcategory...]'). If present in any sample in the batch,
	//   the internal BERT classifier is skipped for the entire batch.

	"project_description_text": "Free text describing the sample.",
	"sample_description_text": "Free text describing this specific sample variant.",
	// ^ fallback — internal BERT classifier (used only when ext_

Related Skills

View on GitHub
GitHub Stars6
CategoryDevelopment
Updated18d ago
Forks5

Languages

Python

Security Score

70/100

Audited on Mar 18, 2026

No findings