Seekr
A library for counting small kmer frequencies in nucleotide sequences.
Install / Use
/learn @CalabreseLab/SeekrREADME
SEEKR
Find communities of nucleotide sequences based on k-mer frequencies.
Installation
To use this library, you need to have =Python3.9.5 or Python3.9.6 on your computer.
Install Python
Please follow the instructions on the Python official website for installing Python 3.9.5 or 3.9.6 on your machine:
Setup default Python
Firstly check which Python is currently used as default
python3 --version
Expected output
Python 3.9.5
or
Python 3.9.6
Then make sure the default Python is the manually installed version, but not the macOS system Python (for Mac users)
which python3
Expected output
/Library/Frameworks/Python.framework/Versions/3.9/bin/python3
If you see the expected output, you can go ahead with the next step: Install CMAKE, and ignore the rest of this session. If you see something different, most likely (/usr/bin/python3), then you need to manually adjust your PATH environment variable.
To adjust your PATH environment variable, firstly figure out which macOS you are using and choose to do either the Zsh or the Bash, but not both.
For Zsh (Default shell on macOS Catalina and later):
nano ~/.zshrc
For Bash (For older macOS or if you switched back to Bash):
nano ~/.bash_profile
This will open up the configuration file or create one if it does not exist yet. Inside the interactive window, copy and paste this line at the end of the file. This prepends the Python 3.9.5 or 3.9.6 bin directory to your PATH
export PATH="/Library/Frameworks/Python.framework/Versions/3.9/bin:$PATH"
Save and exit by: Press Ctrl + X to exit nano; Press Y to confirm saving the changes; Press Enter to write to the file.
Reload Your Shell Configuration after exit nano, by directly type the following command in the terminal
If you did Zsh before
source ~/.zshrc
If you did Bash before
source ~/.bash_profile
Now verify the change
which python3
You should see the expected output now:
/Library/Frameworks/Python.framework/Versions/3.9/bin/python3
Install CMAKE
If you don't already have the CMAKE package, it helps to install it before the SEEKR package.
pip install cmake
Install through Python Package Index (PyPI)
pip install seekr
This will make both the command line tool and the python module available.
Install through Github
pip install git+https://github.com/CalabreseLab/seekr.git
This will install seekr from the main branck of the Github page and make both the command line tool and the python module available.
Install through Docker Hub
First you need to install Docker on your local computer. Then pull the Docker Image:
docker pull calabreselab/seekr:latest
This will install the Docker container which enables running seekr from the command line or Jupyter Notebook. See below the Seekr Docker Image section for more details.
CentOS
Users have been successful in installing seekr from source on CentOS:
conda create --name seekr_source python=3.9
conda activate seekr_source
git clone https://github.com/CalabreseLab/seekr.git
python3 setup.py install
conda install python-igraph
conda install louvain
See this issue for further discussion.
Usage
You can either use SEEKR from the command line or as a python module.
The package is broken up into a set of tools, each of which perform a single task.
From the command line, all of the functions will begin with seekr_.
For example, you can use seekr_kmer_counts to generate a normalized k-mer count matrix of m rows by n columns,
where m is the number of transcripts in a fasta file and n is 4^k-mer.
Then seekr_pearson can be used to calculate how well correlated all pairwise combinations of sequences are.
To see all tools and some examples, run:
seekr
Quickstart
To get a .csv file of communities for every transcript in a small .fa file called example.fa,
(where RNAs have been normalized to a data set of canonical transcripts from GENCODE,
we would run:
seekr_download_gencode lncRNA -g
seekr_filter_gencode v33_lncRNA.fa -gtf v33_lncRNA.chr_patch_hapl_scaff.annotation.gtf -len 500 -can -o v33 # Name may change with GENCODE updates.
seekr_norm_vectors v33_filtered.fa
seekr_kmer_counts example.fa -o 6mers.csv -mv mean.npy -sv std.npy
seekr_pearson 6mers.csv 6mers.csv -o example_vs_self.csv
cat example_vs_self.csv
This quickstart procedure produces a number of other files beside the file example_vs_self.csv.
See below to learn about the other files produced along the way.
Notes:
- We'll use
example.faas a small sample set, if you want to download that file and follow along. - GENCODE is a high quality source for human and mouse lncRNA annotation.
Fasta files can be found here.
- In the examples below we'll generically refer to
gencode.fa. Any sufficiently large fasta file can be used, as needed.
- In the examples below we'll generically refer to
Here are some examples if you just want to get going.
Command line examples
seekr_download
Browsing GENCODE is nice if you want to explore fasta file options. But if you know what you want, you can just download it from the command line. This tool is also helpful on remote clusters.
To download all human transcripts of the latest release into a fasta file, run:
seekr_download_gencode all
GENCODE also stores mouse sequences. You can select mouse using the --species flag:
seekr_download_gencode all -s mouse
For consistency across experiments, you may want to stick to a particular release of GENCODE. To get lncRNAs from the M5 release of mouse, use --release:
seekr_download_gencode lncRNA -s mouse -r M5
If you do not want the script to automatically unzip the file, you can leave the fasta file gzipped with --zip:
seekr_download_gencode all -z
Finally, if you want to download the gtf file from the same species and same release for further useage in sequence filtering with seekr_filter_gencode, you can do so with --gtf:
seekr_download_gencode all -g
seekr_kmer_counts
Let's make a small .csv file of counts.
We'll set a couple flags:
--kmer 2so we only have 16 k-mers--outfile out_counts.csv. This file will contain the log2-transformed z-scores of k-mer counts per kb.
seekr_kmer_counts example.fa -o out_counts.csv -k 2
cat out_counts.csv
You can also see the output of this command here.
Three options are available for log transformation, using the --log2 flag.
Pass --log2 Log2.pre for log transformation of length normalized k-mer counts, with a +1 pseudo-count,
pass --log2 Log2.post for log transformation of z-scores following count standardization (this is the default),
and pass --log2 Log2.none for no log transformation.
If we want to avoid normalization, we can produce k-mer counts per kb by setting the --log2 Log2.none, --uncentered and --unstandardized flags:
seekr_kmer_counts example.fa -o out_counts.csv -k 2 --log2 Log2.none -uc -us
Similarly, if we want a more compact, efficient numpy file,
we can add the --binary and --remove_label flags:
seekr_kmer_counts example.fa -o out_counts.npy -k 2 --binary --remove_label
Note: This numpy file is binary, so you won't be able to view it directly.
What happens if we also remove the --kmer 2 option?
seekr_kmer_counts example.fa -o out_counts.npy
~/seekr/seekr/kmer_counts.py:143: RuntimeWarning: invalid value encountered in true_divide
self.counts /= self.std
WARNING: You have `np.nan` values in your counts after standardization.
This is likely due to a *k*-mer not appearing in any of your sequences. Try:
1) using a smaller *k*-mer size,
2) beginning with a larger set of sequences,
3) passing precomputed normalization vectors from a larger data set (e.g. GENCODE).
The code runs, but we get a warning. That's because we're normalizing 4096 columns of k-mers. Most of those k-mers never appear in any of our 5 lncRNAs. This necessarily results in division by 0. If we use a much larger set of sequences, this same line works fine:
kmer_counts gencode.fa -o gencode_counts.npy
But what should you do if you're only interested in specific sequences?
seekr_norm_vectors
An effective way to find important k-mers in a small number of RNAs is to count their k-mers, but normalize their counts to mean and standard deviation vectors produced from a larger set of transcripts. We can produce these vectors once, then use them on multiple smaller sets of RNAs of interest. To produce the vectors, run:
Note: If --log2 Log2.post is passed in seekr_kmer_counts, then the --log2 Log2.post flag must be passed to seekr_norm_vectors.
This is so that the log-transformed k-mer counts are standardized against reference k-mer counts that are also log transformed.
seekr_norm_vectors gencode.fa
If you run ls, you should see mean.npy and std.npy in your directory.
To specify the path of these output files,
use the --mean_vector and --std_vector flags:
seekr_norm_vectors gencode.fa -k 5 -mv mean_5mers.npy -sv std_5mers.npy
Now, we c
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
