SkillAgentSearch skills...

Seekr

A library for counting small kmer frequencies in nucleotide sequences.

Install / Use

/learn @CalabreseLab/Seekr
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SEEKR

Build Status

Find communities of nucleotide sequences based on k-mer frequencies.

Installation

To use this library, you need to have =Python3.9.5 or Python3.9.6 on your computer.

Install Python

Please follow the instructions on the Python official website for installing Python 3.9.5 or 3.9.6 on your machine:

Python3.9.5

Python3.9.6

Setup default Python

Firstly check which Python is currently used as default

python3 --version

Expected output

Python 3.9.5

or

Python 3.9.6

Then make sure the default Python is the manually installed version, but not the macOS system Python (for Mac users)

which python3

Expected output

/Library/Frameworks/Python.framework/Versions/3.9/bin/python3

If you see the expected output, you can go ahead with the next step: Install CMAKE, and ignore the rest of this session. If you see something different, most likely (/usr/bin/python3), then you need to manually adjust your PATH environment variable.

To adjust your PATH environment variable, firstly figure out which macOS you are using and choose to do either the Zsh or the Bash, but not both.

For Zsh (Default shell on macOS Catalina and later):

nano ~/.zshrc

For Bash (For older macOS or if you switched back to Bash):

nano ~/.bash_profile

This will open up the configuration file or create one if it does not exist yet. Inside the interactive window, copy and paste this line at the end of the file. This prepends the Python 3.9.5 or 3.9.6 bin directory to your PATH

export PATH="/Library/Frameworks/Python.framework/Versions/3.9/bin:$PATH"

Save and exit by: Press Ctrl + X to exit nano; Press Y to confirm saving the changes; Press Enter to write to the file.

Reload Your Shell Configuration after exit nano, by directly type the following command in the terminal

If you did Zsh before

source ~/.zshrc

If you did Bash before

source ~/.bash_profile

Now verify the change

which python3

You should see the expected output now:

/Library/Frameworks/Python.framework/Versions/3.9/bin/python3

Install CMAKE

If you don't already have the CMAKE package, it helps to install it before the SEEKR package.

pip install cmake

Install through Python Package Index (PyPI)

pip install seekr

This will make both the command line tool and the python module available.

Install through Github

pip install git+https://github.com/CalabreseLab/seekr.git

This will install seekr from the main branck of the Github page and make both the command line tool and the python module available.

Install through Docker Hub

First you need to install Docker on your local computer. Then pull the Docker Image:

docker pull calabreselab/seekr:latest

This will install the Docker container which enables running seekr from the command line or Jupyter Notebook. See below the Seekr Docker Image section for more details.

CentOS

Users have been successful in installing seekr from source on CentOS:

conda create --name seekr_source python=3.9
conda activate seekr_source
git clone https://github.com/CalabreseLab/seekr.git
python3 setup.py install
conda install python-igraph
conda install louvain

See this issue for further discussion.

Usage

You can either use SEEKR from the command line or as a python module. The package is broken up into a set of tools, each of which perform a single task. From the command line, all of the functions will begin with seekr_. For example, you can use seekr_kmer_counts to generate a normalized k-mer count matrix of m rows by n columns, where m is the number of transcripts in a fasta file and n is 4^k-mer. Then seekr_pearson can be used to calculate how well correlated all pairwise combinations of sequences are.

To see all tools and some examples, run:

seekr

Quickstart

To get a .csv file of communities for every transcript in a small .fa file called example.fa, (where RNAs have been normalized to a data set of canonical transcripts from GENCODE, we would run:

seekr_download_gencode lncRNA -g
seekr_filter_gencode v33_lncRNA.fa -gtf v33_lncRNA.chr_patch_hapl_scaff.annotation.gtf -len 500 -can -o v33 # Name may change with GENCODE updates.
seekr_norm_vectors v33_filtered.fa
seekr_kmer_counts example.fa -o 6mers.csv -mv mean.npy -sv std.npy
seekr_pearson 6mers.csv 6mers.csv -o example_vs_self.csv
cat example_vs_self.csv

This quickstart procedure produces a number of other files beside the file example_vs_self.csv. See below to learn about the other files produced along the way.

Notes:

  • We'll use example.fa as a small sample set, if you want to download that file and follow along.
  • GENCODE is a high quality source for human and mouse lncRNA annotation. Fasta files can be found here.
    • In the examples below we'll generically refer to gencode.fa. Any sufficiently large fasta file can be used, as needed.

Here are some examples if you just want to get going.

Command line examples

seekr_download

Browsing GENCODE is nice if you want to explore fasta file options. But if you know what you want, you can just download it from the command line. This tool is also helpful on remote clusters.

To download all human transcripts of the latest release into a fasta file, run:

seekr_download_gencode all

GENCODE also stores mouse sequences. You can select mouse using the --species flag:

seekr_download_gencode all -s mouse

For consistency across experiments, you may want to stick to a particular release of GENCODE. To get lncRNAs from the M5 release of mouse, use --release:

seekr_download_gencode lncRNA -s mouse -r M5

If you do not want the script to automatically unzip the file, you can leave the fasta file gzipped with --zip:

seekr_download_gencode all -z

Finally, if you want to download the gtf file from the same species and same release for further useage in sequence filtering with seekr_filter_gencode, you can do so with --gtf:

seekr_download_gencode all -g

seekr_kmer_counts

Let's make a small .csv file of counts. We'll set a couple flags:

  • --kmer 2 so we only have 16 k-mers
  • --outfile out_counts.csv. This file will contain the log2-transformed z-scores of k-mer counts per kb.
seekr_kmer_counts example.fa -o out_counts.csv -k 2
cat out_counts.csv

You can also see the output of this command here.

Three options are available for log transformation, using the --log2 flag. Pass --log2 Log2.pre for log transformation of length normalized k-mer counts, with a +1 pseudo-count, pass --log2 Log2.post for log transformation of z-scores following count standardization (this is the default), and pass --log2 Log2.none for no log transformation.

If we want to avoid normalization, we can produce k-mer counts per kb by setting the --log2 Log2.none, --uncentered and --unstandardized flags:

seekr_kmer_counts example.fa -o out_counts.csv -k 2 --log2 Log2.none -uc -us

Similarly, if we want a more compact, efficient numpy file, we can add the --binary and --remove_label flags:

seekr_kmer_counts example.fa -o out_counts.npy -k 2 --binary --remove_label

Note: This numpy file is binary, so you won't be able to view it directly.

What happens if we also remove the --kmer 2 option?

seekr_kmer_counts example.fa -o out_counts.npy
~/seekr/seekr/kmer_counts.py:143: RuntimeWarning: invalid value encountered in true_divide
  self.counts /= self.std

WARNING: You have `np.nan` values in your counts after standardization.
This is likely due to a *k*-mer not appearing in any of your sequences. Try:
1) using a smaller *k*-mer size,
2) beginning with a larger set of sequences,
3) passing precomputed normalization vectors from a larger data set (e.g. GENCODE).

The code runs, but we get a warning. That's because we're normalizing 4096 columns of k-mers. Most of those k-mers never appear in any of our 5 lncRNAs. This necessarily results in division by 0. If we use a much larger set of sequences, this same line works fine:

kmer_counts gencode.fa -o gencode_counts.npy

But what should you do if you're only interested in specific sequences?

seekr_norm_vectors

An effective way to find important k-mers in a small number of RNAs is to count their k-mers, but normalize their counts to mean and standard deviation vectors produced from a larger set of transcripts. We can produce these vectors once, then use them on multiple smaller sets of RNAs of interest. To produce the vectors, run:

Note: If --log2 Log2.post is passed in seekr_kmer_counts, then the --log2 Log2.post flag must be passed to seekr_norm_vectors. This is so that the log-transformed k-mer counts are standardized against reference k-mer counts that are also log transformed.

seekr_norm_vectors gencode.fa

If you run ls, you should see mean.npy and std.npy in your directory.

To specify the path of these output files, use the --mean_vector and --std_vector flags:

seekr_norm_vectors gencode.fa -k 5 -mv mean_5mers.npy -sv std_5mers.npy

Now, we c

Related Skills

View on GitHub
GitHub Stars27
CategoryDevelopment
Updated2mo ago
Forks14

Languages

Python

Security Score

90/100

Audited on Jan 9, 2026

No findings