Isanreg

An Interpretable Self-Attention Network with block-attention and attention-attribution.

Generate Convert Improve

Install / Use

/learn @anilprakash94/Isanreg

About this skill

Quality Score

0/100

README

ISANREG

ISANREG is an Interpretable Self-Attention Network that uses block-attention and attention-attribution to learn REGulatory features.

Please cite the article "An interpretable block-attention network for identifying regulatory feature interactions, Briefings in Bioinformatics, 2023;, bbad250, https://doi.org/10.1093/bib/bbad250"

Dependencies

Python version = 3.8.8
OS = Ubuntu 20.04.4

Python Libraries

tensorflow (2.7.0) (Tensorflow dependencies recommended for the specific version is required for GPU support)
numpy (1.23.1)
matplotlib (3.4.2)
seaborn (0.11.1)
pandas (1.3.1)
pyfaidx (0.6.2)
scipy (1.8.1)

Other dependencies

FIMO (Find Individual Motif Occurrences) from MEME Suite (meme-5.4.1)
Circos for plotting (0.69-9)

Required Files

Human reference genome (hg38.fa)
Encode Exclusion files ("GRCh38_unified_blacklist.bed","dukeExcludeRegions.bed")
TF Chip-seq processed narrowpeak bed file for training input data generation
TF binding motif in Meme format from JASPAR
Meme file of all Homo sapiens specific TF motifs from CisBP ("Homo_sapiens.meme")

Scripts

ISANREG can be run on simulated inputs and in-vitro dataset derived inputs.

Simulated data

The scripts for running the model on simulated data are:

simulated_input.py

--"Homo_sapiens.meme" file needed as input and outputs simulated training, validation and testing files.

simulated_train.py

--Trains the model on simulated training data.

simulated_test.py

--Testing the model after training. The high affinity motifs of both the TFs are given as input. The testing input "simulated_test2k.txt" and weights files are provided in the repository which can be directly used for testing the model.

simulated_distplot.py

--Generates Swarm plot of distance between the embedded motifs. The attention-attribution output file generated after testing on simulated sequences is given as input. The file "simtest_out.txt" is provided in the repository which can be directly used for plotting.

In-vitro analysis

The scripts for running the model on in-vitro datasets are:

isanreg_dataprocess.py

--Creates training, validation and testing data for in-vitro datasets. Human reference genome (hg38.fa), Encode Exclusion files, TF Chip-seq processed narrowpeak bed file and TF binding motif in Meme format from JASPAR are given as input

isanreg_train.py

--Trains the model on TF specific training data.

isanreg_test.py

--Testing the model after training. Length of the TF motif is given as "--core_len" argument. Calculates the enrichment and significance of motifs from the attention-attribution data which helps in identifying interacting TFs. The input file for testing and weights file of the trained model for ESR1 is provided with the repository which can be used for testing.

isanreg_circos.py

--Creates input files needed for plotting enriched motifs of individual TFs according to circos requirements. The file "ESR1_out.txt" is provided in the repository which can be directly used for generating inputs for circos
--Circos .conf files have to be manually created for plotting.

isanreg_allplot.py

--Creates input files needed for plotting top enriched motifs of all the TFs according to circos requirements. 
--Circos .conf files and additional highlight file for validated interacting TFs have to be manually created for plotting.

Usage

Running the model

git clone https://github.com/anilprakash94/isanreg.git isanreg

cd isanreg

Then, run the programs according to the requirements and instructions listed in README.md.

For example:

python3 isanreg_dataprocess.py -h

usage: isanreg_dataprocess.py [-h] [--data_dir DATA_DIR] [-r REF_FASTA]
                              [--excl_files EXCL_FILES] [--seq_len SEQ_LEN]
                              [-i INPUT_FILE] [--tf_name TF_NAME]
                              [--tf_motif TF_MOTIF]

ISANREG Data processing

optional arguments:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   name of folder having data files
  -r REF_FASTA, --ref_fasta REF_FASTA
                        specify the reference genome fasta file
  --excl_files EXCL_FILES
                        path of bed files having exclusion regions
  --seq_len SEQ_LEN     total flanking sequence length for each input
  -i INPUT_FILE, --input_file INPUT_FILE
                        path of input file having chip_seq peak regions
  --tf_name TF_NAME     specify the name of the TF under study
  --tf_motif TF_MOTIF   specify the TF motif meme file

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。