PhaMers
A bioinformatic tool for identifying bacteriophages using machine learning and k-mers
Install / Use
/learn @jondeaton/PhaMersREADME
PhaMers (Phage k-mers)
This repository contains the implementation of "PhaMers"- a bioinformatic tool for identifying bacteriophages (phages) from metagenomic sequenging data on the basis of their k-mer frequencies. PhaMers uses basic techniques from supervised machine learning with k-mer frequency vectors as a the feature representaton. The PhaMers classificaiton algorithm is trained on k-mer feature vectors from GenBank and RefSeq.
This repository also contains utilities to analyze and plot DNA sequences to facilitate understanding of metagenomic datasets.
Installation
PhaMers requires Python 2.7 as well as some basic packages which may be installed by running the following command
pip install -r requirements.txt
Usage
To score DNA sequences using PhaMers, use the following command
python scripts/phamer.py -in $input_dir -data data --debug --equalize_reference
where the variable $input_dir is a path to a directory that contains a collection
of FASTA files with sequences that will be scored. This will create a directory
called phamer_output containing the scores in a file called phamer_scores.csv.
Scores range from -1 to 1 with higher scores indicating that the sequence
is more phage-like.
To run further analysis and generate plots, run the following command
python scripts/analysis.py -in $input_dir -data data --debug
with the variable $input_dir as before.
Scripts
This repository contains quite a few different scripts, which are now briefly described:
-
phamer.py
- The main PhaMers scoring funcitonality is contained here. This script can take in files in fasta format, count k-mers, and score files against referece datasets. This script can also do t-SNE on the combined datasets.
-
analysis.py
- This script integrats and presents data from PhaMers, VirSorter, and IMG. This script makes t-SNE plots of metagenomics datasets, contig diagrams, performance plots, and text files that summarize results.
-
feature_taxonomy.py
- A class and functions that do t-SNE and cluster points to examine enrichmet for taxa.
-
cross_validate.py
- A class and functions that help to do N-Fold cross validation
-
kmer.py
- Functions for counting k-mers.
-
cluster.py
- Cluster optimization analysis.
-
learning.py
- Some functions that implement several tools useful for machine learning and some wrapper functions for Scipy ML functions.
-
distinguishable_colors.py
- Some functions for getting a set of colors that are able to be distinguished from eachoter visually.
-
fileIO.py
- Some functions for getting data in and out of files that score inputs and outputs for PhaMers
-
id_parser.py
- Functions that help parse headers of different formats to turn them in to IDs.
-
img_parser.py
- Functions for parsing IMG output files.
-
basic.py
- Some basic utility functions that might be useful in any program
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
401Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
