SkillAgentSearch skills...

GenoPHI

Python package for phage-host interaction prediction

Install / Use

/learn @Noonanav/GenoPHI
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

GenoPHI

(jee-no-fee)

Genotype-to-Phenotype Phage-Host Interaction Prediction

GenoPHI is a Python package for machine learning-based prediction of genotype-phenotype relationships using whole-genome sequence data. Originally designed for phage-host interaction prediction, GenoPHI supports both binary interaction prediction and regression tasks for any microbial phenotype. The package implements protein family-based and k-mer-based approaches to extract genomic features from amino acid sequences and predict phenotypes using CatBoost gradient boosting models.

PyPI version License: MIT Python 3.8+

Workflow Overview

GenoPHI Workflow Figure 1: GenoPHI workflow schematic showing the main analysis pipelines: Protein family-based workflow, K-mer-based workflow, and Predictive protein k-mer workflow. Each pathway includes feature extraction, selection, model training, and prediction steps.

Table of Contents

Features

Protein Family-Based Analysis

  1. MMSeqs2 Clustering: Cluster protein sequences into protein families based on sequence similarity
  2. Feature Table Generation: Create presence-absence matrices of protein families across genomes and consolidate into predictive features based on co-occurence across genomes
  3. Feature Selection: Identify predictive protein families (multiple available methods: RFE, SHAP, SHAP-RFE, ANOVA, Chi-squared, Lasso)
  4. Model Training: Train CatBoost models with hyperparameter optimization
  5. Phenotype Prediction: Predict interactions, resistance, or other phenotypes for new genomes
  6. Feature Annotation: Identify predictive protein sequences from predictive features

K-mer-Based Analysis

  1. K-mer Feature Extraction: Generate k-mer features from protein sequences with or without gene family context
  2. Predictive K-mer Workflow: Extract k-mers specifically from predictive protein families identified in protein family analysis
  3. Feature Selection & Modeling: Apply same robust feature selection and modeling pipelines
  4. Flexible K-mer Lengths: Support for single k-mer length or ranges (e.g., 3-6)

Application Modes

  • Phage-Host Interaction Prediction: Binary prediction of infection outcomes between phages and bacterial strains
  • Single-Strain Phenotype Prediction: Predict strain-level phenotypes (e.g., antibiotic resistance, growth rate) without requiring phage data
  • Regression Tasks: Predict continuous phenotypes (e.g., infection efficiency, metabolic rates)
  • General Feature-Based Modeling: Use any feature table with a phenotype column for custom applications

Advanced Capabilities

  • Dynamic Feature Weighting: Account for feature frequency distributions to handle imbalanced features
  • Clustering-Based Selection: Use HDBSCAN or hierarchical clustering for intelligent feature grouping
  • Multiple Feature Selection Methods: RFE, SHAP-RFE, SelectKBest, Chi-squared, Lasso, SHAP
  • Comprehensive Performance Metrics: AUC-ROC, Precision-Recall, MCC, F1-score, Accuracy
  • SHAP Interpretability: Feature importance analysis and visualization for model explainability
  • Bootstrapping Support: Robust model evaluation with multiple train-test splits

Installation

System Requirements

Minimum Requirements:

  • Python 3.8 or higher
  • 8 GB RAM
  • 4 CPU cores
  • 10 GB free disk space

Recommended for Large Datasets:

  • Python 3.10+
  • 32+ GB RAM
  • 8+ CPU cores
  • 50+ GB free disk space (depending on dataset size)

Tested Operating Systems:

  • Linux (Ubuntu 20.04+, CentOS 7+)
  • macOS (Sonoma 14+, Apple Silicon)

Virtual Environment (Recommended)

Create and activate a conda environment:

conda create -n genophi python=3.10
conda activate genophi

Install GenoPHI

From PyPI (Recommended):

pip install genophi

From GitHub (Development):

git clone https://github.com/Noonanav/GenoPHI.git
cd GenoPHI
pip install -e .

For development with optional dependencies:

pip install -e ".[dev]"

Install MMseqs2

External Dependency: GenoPHI requires MMseqs2 for protein sequence clustering and assignment.

Install via conda/mamba:

conda install -c bioconda mmseqs2
# or
mamba install -c bioconda mmseqs2

For other installation methods, see the MMSeqs2 Wiki.

Verify Installation

Test that GenoPHI is properly installed:

# Check GenoPHI version
genophi --version

# Verify MMseqs2 is accessible
mmseqs version

# Run basic help command
genophi --help

Typical Install Time

Full installation (conda environment + GenoPHI + MMseqs2) takes approximately 2-3 minutes on a standard desktop computer (tested on a MacBook Pro M2, 16 GB RAM, macOS Sonoma 14.3).

Demo

A small test dataset is included in the repository for demonstrating the software. To run the demo:

git clone https://github.com/Noonanav/GenoPHI.git
cd GenoPHI

genophi protein-family-workflow \
    --input_path_strain data/test_data/strain_AAs/ \
    --input_path_phage data/test_data/phage_AAs/ \
    --phenotype_matrix data/test_data/ecoli_test_interaction_matrix.csv \
    --output_dir demo_output/ \
    --threads 4 \
    --num_features 50 \
    --num_runs_fs 5 \
    --num_runs_modeling 10 \
    --method rfe \
    --filter_type strain

Test dataset: 25 E. coli strains and 25 phages with a binary interaction matrix.

Expected output: A demo_output/ directory containing MMseqs2 clustering results, feature selection outputs, trained models, performance metrics, and a workflow summary report. See Output Directory Structure for details.

Expected run time: ~25 minutes on a standard desktop computer (MacBook Pro M2, 16 GB RAM).

Quick Start

GenoPHI provides a unified command-line interface accessible through the genophi command:

# View available commands
genophi --help

# Get help for a specific command
genophi protein-family-workflow --help

Recommended Default Run

For most phage-host interaction prediction tasks, use these recommended settings:

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --input_path_phage phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output_dir results/ \
    --threads 8 \
    --num_features 100 \
    --num_runs_fs 25 \
    --num_runs_modeling 50 \
    --method rfe \
    --use_clustering \
    --cluster_method hierarchical \
    --n_clusters 20 \
    --filter_type strain \
    --use_shap

Key Parameters Explained:

  • --num_features 100: Select top 100 features (adjust based on dataset size)
  • --num_runs_fs 25: 25 iterations for robust feature selection
  • --num_runs_modeling 50: 50 modeling runs for reliable performance estimates
  • --method rfe: Recursive Feature Elimination (balanced performance)
  • --use_clustering: Enable sample clustering-aware filtering
  • --filter_type strain: Critical for phage-host prediction - Ensures train/test splits separate by strain so the model learns to predict on new strains it hasn't seen before
  • --use_shap: Generate SHAP plots and feature importance analysis for model interpretability

Note: For phage-host interaction prediction, --filter_type strain is strongly recommended. This controls how train/test splits are made during feature selection and modeling, ensuring the model never sees the same strain in both training and testing. This forces the model to learn generalizable patterns rather than memorizing specific strain characteristics.

For single-strain phenotypes (no phage data):

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --phenotype_matrix phenotypes.csv \
    --output_dir results/ \
    --threads 8 \
    --sample_column strain \
    --phenotype_column resistance \
    --filter_type none

Usage

CLI Commands Overview

GenoPHI provides the following main commands:

| Command | Description | |---------|-------------| | protein-family-workflow | Recommended basic workflow: Complete protein family-based workflow | | full-workflow | Protein families → k-mers from predictive proteins | | kmer-workflow | Complete k-mer-based workflow from all proteins | | cluster | Generate protein family clusters and feature tables | | select-features | Perform feature selection on any feature table | | train | Train predictive models on selected features | | predict | Predict phenotypes using trained models | | select-and-train | Feature selection + modeling from any feature table | | assign-features | Assign features to new genomes | | assign-predict | Assign features and predict (protein families) | | annotate | Annotate predictive features with functional info | | kmer-assign-features | Assign k-mer features to new genomes | | `kmer-assign-p

Related Skills

View on GitHub
GitHub Stars12
CategoryDevelopment
Updated4d ago
Forks6

Languages

Jupyter Notebook

Security Score

75/100

Audited on Apr 7, 2026

No findings