PhaMers

A bioinformatic tool for identifying bacteriophages using machine learning and k-mers

Generate Convert Improve

Install / Use

/learn @jondeaton/PhaMers

About this skill

Quality Score

0/100

README

PhaMers (Phage k-mers)

This repository contains the implementation of "PhaMers"- a bioinformatic tool for identifying bacteriophages (phages) from metagenomic sequenging data on the basis of their k-mer frequencies. PhaMers uses basic techniques from supervised machine learning with k-mer frequency vectors as a the feature representaton. The PhaMers classificaiton algorithm is trained on k-mer feature vectors from GenBank and RefSeq.

This repository also contains utilities to analyze and plot DNA sequences to facilitate understanding of metagenomic datasets.

Installation

PhaMers requires Python 2.7 as well as some basic packages which may be installed by running the following command

pip install -r requirements.txt

Usage

To score DNA sequences using PhaMers, use the following command

python scripts/phamer.py -in $input_dir -data data --debug --equalize_reference

where the variable $input_dir is a path to a directory that contains a collection of FASTA files with sequences that will be scored. This will create a directory called phamer_output containing the scores in a file called phamer_scores.csv. Scores range from -1 to 1 with higher scores indicating that the sequence is more phage-like.

To run further analysis and generate plots, run the following command

python scripts/analysis.py -in $input_dir -data data --debug

with the variable $input_dir as before.

Scripts

This repository contains quite a few different scripts, which are now briefly described:

phamer.py
- The main PhaMers scoring funcitonality is contained here. This script can take in files in fasta format, count k-mers, and score files against referece datasets. This script can also do t-SNE on the combined datasets.
analysis.py
- This script integrats and presents data from PhaMers, VirSorter, and IMG. This script makes t-SNE plots of metagenomics datasets, contig diagrams, performance plots, and text files that summarize results.
feature_taxonomy.py
- A class and functions that do t-SNE and cluster points to examine enrichmet for taxa.
cross_validate.py
- A class and functions that help to do N-Fold cross validation
kmer.py
- Functions for counting k-mers.
cluster.py
- Cluster optimization analysis.
learning.py
- Some functions that implement several tools useful for machine learning and some wrapper functions for Scipy ML functions.
distinguishable_colors.py
- Some functions for getting a set of colors that are able to be distinguished from eachoter visually.
fileIO.py
- Some functions for getting data in and out of files that score inputs and outputs for PhaMers
id_parser.py
- Functions that help parse headers of different formats to turn them in to IDs.
img_parser.py
- Functions for parsing IMG output files.
basic.py
- Some basic utility functions that might be useful in any program

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

401

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

jondeaton

View profile

View on GitHub

GitHub Stars6

CategoryEducation

Updated2y ago

Forks1

jondeaton/PhaMers

Languages

Python

Security Score

60/100

Audited on Apr 27, 2023

No findings