GenoPHI

Python package for phage-host interaction prediction

Generate Convert Improve

Install / Use

/learn @Noonanav/GenoPHI

About this skill

Quality Score

0/100

README

GenoPHI

(jee-no-fee)

Genotype-to-Phenotype Phage-Host Interaction Prediction

GenoPHI is a Python package for machine learning-based prediction of genotype-phenotype relationships using whole-genome sequence data. Originally designed for phage-host interaction prediction, GenoPHI supports both binary interaction prediction and regression tasks for any microbial phenotype. The package implements protein family-based and k-mer-based approaches to extract genomic features from amino acid sequences and predict phenotypes using CatBoost gradient boosting models.

Workflow Overview

Figure 1: GenoPHI workflow schematic showing the main analysis pipelines: Protein family-based workflow, K-mer-based workflow, and Predictive protein k-mer workflow. Each pathway includes feature extraction, selection, model training, and prediction steps.

Features
Installation
Quick Start
Usage
- CLI Commands Overview
- Workflows
Input Data Formats
Feature Selection Methods
Performance Metrics
Output Directory Structure
Python API
Advanced Usage
Troubleshooting
Testing
Best Practices
Citation
Contributing
License
Support
Acknowledgments

Features

Protein Family-Based Analysis

MMSeqs2 Clustering: Cluster protein sequences into protein families based on sequence similarity
Feature Table Generation: Create presence-absence matrices of protein families across genomes and consolidate into predictive features based on co-occurence across genomes
Feature Selection: Identify predictive protein families (multiple available methods: RFE, SHAP, SHAP-RFE, ANOVA, Chi-squared, Lasso)
Model Training: Train CatBoost models with hyperparameter optimization
Phenotype Prediction: Predict interactions, resistance, or other phenotypes for new genomes
Feature Annotation: Identify predictive protein sequences from predictive features

K-mer-Based Analysis

K-mer Feature Extraction: Generate k-mer features from protein sequences with or without gene family context
Predictive K-mer Workflow: Extract k-mers specifically from predictive protein families identified in protein family analysis
Feature Selection & Modeling: Apply same robust feature selection and modeling pipelines
Flexible K-mer Lengths: Support for single k-mer length or ranges (e.g., 3-6)

Application Modes

Phage-Host Interaction Prediction: Binary prediction of infection outcomes between phages and bacterial strains
Single-Strain Phenotype Prediction: Predict strain-level phenotypes (e.g., antibiotic resistance, growth rate) without requiring phage data
Regression Tasks: Predict continuous phenotypes (e.g., infection efficiency, metabolic rates)
General Feature-Based Modeling: Use any feature table with a phenotype column for custom applications

Advanced Capabilities

Dynamic Feature Weighting: Account for feature frequency distributions to handle imbalanced features
Clustering-Based Selection: Use HDBSCAN or hierarchical clustering for intelligent feature grouping
Multiple Feature Selection Methods: RFE, SHAP-RFE, SelectKBest, Chi-squared, Lasso, SHAP
Comprehensive Performance Metrics: AUC-ROC, Precision-Recall, MCC, F1-score, Accuracy
SHAP Interpretability: Feature importance analysis and visualization for model explainability
Bootstrapping Support: Robust model evaluation with multiple train-test splits

Installation

System Requirements

Minimum Requirements:

Python 3.8 or higher
8 GB RAM
4 CPU cores
10 GB free disk space

Recommended for Large Datasets:

Python 3.10+
32+ GB RAM
8+ CPU cores
50+ GB free disk space (depending on dataset size)

Tested Operating Systems:

Linux (Ubuntu 20.04+, CentOS 7+)
macOS (Sonoma 14+, Apple Silicon)

Virtual Environment (Recommended)

Create and activate a conda environment:

conda create -n genophi python=3.10
conda activate genophi

Install GenoPHI

From PyPI (Recommended):

pip install genophi

From GitHub (Development):

git clone https://github.com/Noonanav/GenoPHI.git
cd GenoPHI
pip install -e .

For development with optional dependencies:

pip install -e ".[dev]"

Install MMseqs2

External Dependency: GenoPHI requires MMseqs2 for protein sequence clustering and assignment.

Install via conda/mamba:

conda install -c bioconda mmseqs2
# or
mamba install -c bioconda mmseqs2

For other installation methods, see the MMSeqs2 Wiki.

Verify Installation

Test that GenoPHI is properly installed:

# Check GenoPHI version
genophi --version

# Verify MMseqs2 is accessible
mmseqs version

# Run basic help command
genophi --help

Typical Install Time

Full installation (conda environment + GenoPHI + MMseqs2) takes approximately 2-3 minutes on a standard desktop computer (tested on a MacBook Pro M2, 16 GB RAM, macOS Sonoma 14.3).

Demo

A small test dataset is included in the repository for demonstrating the software. To run the demo:

git clone https://github.com/Noonanav/GenoPHI.git
cd GenoPHI

genophi protein-family-workflow \
    --input_path_strain data/test_data/strain_AAs/ \
    --input_path_phage data/test_data/phage_AAs/ \
    --phenotype_matrix data/test_data/ecoli_test_interaction_matrix.csv \
    --output_dir demo_output/ \
    --threads 4 \
    --num_features 50 \
    --num_runs_fs 5 \
    --num_runs_modeling 10 \
    --method rfe \
    --filter_type strain

Test dataset: 25 E. coli strains and 25 phages with a binary interaction matrix.

Expected output: A demo_output/ directory containing MMseqs2 clustering results, feature selection outputs, trained models, performance metrics, and a workflow summary report. See Output Directory Structure for details.

Expected run time: ~25 minutes on a standard desktop computer (MacBook Pro M2, 16 GB RAM).

Quick Start

GenoPHI provides a unified command-line interface accessible through the genophi command:

# View available commands
genophi --help

# Get help for a specific command
genophi protein-family-workflow --help

Recommended Default Run

For most phage-host interaction prediction tasks, use these recommended settings:

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --input_path_phage phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output_dir results/ \
    --threads 8 \
    --num_features 100 \
    --num_runs_fs 25 \
    --num_runs_modeling 50 \
    --method rfe \
    --use_clustering \
    --cluster_method hierarchical \
    --n_clusters 20 \
    --filter_type strain \
    --use_shap

Key Parameters Explained:

--num_features 100: Select top 100 features (adjust based on dataset size)
--num_runs_fs 25: 25 iterations for robust feature selection
--num_runs_modeling 50: 50 modeling runs for reliable performance estimates
--method rfe: Recursive Feature Elimination (balanced performance)
--use_clustering: Enable sample clustering-aware filtering
--filter_type strain: Critical for phage-host prediction - Ensures train/test splits separate by strain so the model learns to predict on new strains it hasn't seen before
--use_shap: Generate SHAP plots and feature importance analysis for model interpretability

Note: For phage-host interaction prediction, --filter_type strain is strongly recommended. This controls how train/test splits are made during feature selection and modeling, ensuring the model never sees the same strain in both training and testing. This forces the model to learn generalizable patterns rather than memorizing specific strain characteristics.

For single-strain phenotypes (no phage data):

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --phenotype_matrix phenotypes.csv \
    --output_dir results/ \
    --threads 8 \
    --sample_column strain \
    --phenotype_column resistance \
    --filter_type none

Usage

CLI Commands Overview

GenoPHI provides the following main commands:

| Command | Description | |---------|-------------| | protein-family-workflow | Recommended basic workflow: Complete protein family-based workflow | | full-workflow | Protein families → k-mers from predictive proteins | | kmer-workflow | Complete k-mer-based workflow from all proteins | | cluster | Generate protein family clusters and feature tables | | select-features | Perform feature selection on any feature table | | train | Train predictive models on selected features | | predict | Predict phenotypes using trained models | | select-and-train | Feature selection + modeling from any feature table | | assign-features | Assign features to new genomes | | assign-predict | Assign features and predict (protein families) | | annotate | Annotate predictive features with functional info | | kmer-assign-features | Assign k-mer features to new genomes | | `kmer-assign-p

Related Skills

node-connect

354.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

Noonanav

View profile

View on GitHub

GitHub Stars12

CategoryDevelopment

Updated4d ago

Forks6

Noonanav/GenoPHI

Languages

Jupyter Notebook

Security Score

75/100

Audited on Apr 7, 2026

No findings

GenoPHI

Install / Use

README

GenoPHI

Workflow Overview

Table of Contents

Features

Protein Family-Based Analysis

K-mer-Based Analysis

Application Modes

Advanced Capabilities

Installation

System Requirements

Virtual Environment (Recommended)

Install GenoPHI

Install MMseqs2

Verify Installation

Typical Install Time

Demo

Quick Start

Recommended Default Run

Usage

CLI Commands Overview

Related Skills