UnsupervisedPeakCaller

Peak calling for ATAC-seq data using contrastive learning on biological replicates

Generate Convert Improve

Install / Use

/learn @Tuteja-Lab/UnsupervisedPeakCaller

About this skill

Quality Score

0/100

README

The Replicative Contrastive Learner (RCL), an Unsupervised Contrastive Peak Caller

This page described the Unsupervised contrastive peak caller known as RCL (Replicative Contrastive Learner). The accompanying publication is available: 10.1101/gr.277677.123.

Prerequisites
Quickstart
Installation
Tutorial
Input
1. Tutorial Step 1
Preprocessing
Peak Calling
How to Cite
Contact

Prerequisites <a name = "prerequisites" />

For input preprocessing steps, the following tools and R libraries are required:

bash (>= 5.2)
coreutils (>= 9.3)
perl (>= 5.38)
samtools (>= 1.10)
bedtools2 (>= 2.27.1)
parallel (>= 20170322)
bedops (>= 2.4.35)
R (>= 4.0.2)
R library dplyr (>= 1.0.7)
R library bedr (>= 1.0.7)
R library doParallel (>= 1.0.16)

For the deep learner step, a GPU is needed. Other packages needed are:

Python (>=3.7.10)
PyTorch Lightning (>=1.5.1)
PyTorch (>=1.10.0)
pandas (>=1.3.5)
scipy (>1.11.3)
scikit-learn (>=1.0.1)

Quickstart <a name = "quickstart" />

Here is a demo of the steps needed to get started with RCL on a Fedora 39 install:

## install dependencies
# task: install non-python dependencies in the root environment
# task: make sure gpu is recognized
conda create -n rcl		# creating rcl conda environment
conda install -n rcl pytorch	# Pytorch
conda install -n rcl lightning	# Pytorch Lightening
conda install -n rcl scipy scikit-learn pandas
conda activate rcl

## the remaining commands follow tutorial described in this README
git clone https://github.com/Tuteja-Lab/UnsupervisedPeakCaller.git
cd UnsupervisedPeakCaller 
# task: download data as RCLexamples.zip into current directory
unzip -j RCLexamples.zip -d example
cp -r example example.save
# next two commands force fresh run by overwriting existing output files
bash ./preprocessing.bash -d example -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam" -t 20 -n test
bash ./run_rcl.sh -d example -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam" -w
diff --brief example example.save
# stored copy of RCL input files appear in example
# random initialization slightly alters RCL fit (example/rcl.ckpt) and scores (example/rcl.bed)

Installation <a name = "installation" />

After installing the prerequisites, all you have to do is clone RCL and move into the root directory of the cloned RCL repository:

git clone https://github.com/Tuteja-Lab/UnsupervisedPeakCaller.git
cd UnsupervisedPeakCaller

Tutorial <a name = "tutorial" />

The RCL pipeline starts after you obtain reference-aligned read data. The pipeline consists of a data preprocessor and a peak caller, which are covered in detail in this document. To demonstrate the pipeline steps on a small dataset, we have prepared a small tutorial. The tutorial consists of three parts discussed in corresponding parts of this document. Quick links to all three parts are listed here:

Tutorial Step 1: Get the data.
Tutorial Step 2: Preprocess the data.
Tutorial Step 3: Peak calling.

Input <a name = "input" />

The RCL preprocessor requires BAM files for each replicate. The RCL peak caller requires the output of the preprocessing step. You can read more about peak caller input.

Example: Tutorial Step 1 <a name = "data_example" />

To demonstrate RCL, we provide the portion of the MCF-7 dataset aligning to human chromosome 10. The output for this example are provided with RCL, so if you want to skip data preprocessing for now, you can go directly to peak calling. Or to compare the output of your run to the tutorial output we obtained, save a copy of the example directory.

Continuing with the preprocessing demonstration, download the necessary BAM files and indices from https://iastate.box.com/s/9uavg2zsy5w0i7v227ei7yaeea7yktr6. If you download the zip file RCLexamples.zip from the cybox and place it in the root of the RCL git repository, the following commands (executed from the root of the RCL git repository) will place them appropriately:

unzip -j RCLexamples.zip -d example

You can skip to Tutorial Step 2.

Preprocessing <a name = "preprocessing" />

We have provided a bash preprocessing script to convert input BAM files (see input) into the required RCL input. The script assumes your data have been aligned to the Ensembl assembly of the mouse or human genome. If not, the script will still run (though it is important you use command line option -g), but no blacklist regions will be removed.

Preprocessing Input <a name = "preprocessing_input" />

The preprocessing input is the same as the pipeline input.

Preprocessing Command-Line Options <a name = "preprocessing_options" />

For more information about the preprocessing script type bash ./preprocessing.bash -? from the RCL git repository root. The most important command-line options are mentioned below:

-b (DEFAULT: none, you must provide): The BAM files for the individual replicates. You must name at least two BAM files from replicate experiments. Space-separate the names and surround the list with double quotes. For example, the tutorial is run with this options set as -b "MCF7_chr10_rep1.bam MCF7_chr10_rep2.bam".
-c (DEFAULT: median): An integer coverage cutoff to identify candidate peaks. The default is to use the minimum median (zero coverage sites excluded) observed across replicates on a per-chromosome basis. However, you can call more peaks by reducing this number. In the RCL publication, we demonstrate that decreasing this cutoff generally enhances RCL performance, but it cannot be less than 1.
-d (DEFAULT: example): The data directory, where both input and output will be written.
-g (DEFAULT: hg): Indicate the genome reads in the input BAM files are aligned to. The default assumes the reads are aligned to the Ensembl assembly hg38. If you are using the Ensembl assembly of mouse, you should set option -g mm. If you are using another genome, you should name it as you like -g my_id, but not mm or hg. It is very important that you do not leave the default value if your data are not aligned to hg38!
-t (DEFAULT: 1): Set the number of threads you would like to use. Most preprocessing steps have been parallelized, so take advantage of it with this command option.
-n (DEFAULT: out): Data preprocessing is an expensive operations with many intermediate files. To keep track or maintain multiple versions of those files, name them with this command option. All intermediate and final files will be prefixed with this identifier and placed in the output directory (option -o) <i>if you use the save option</i> (-s). You can reuse these saved files and copy them to the expected input files for RCL by running the preprocessing command again. When the files have already been generated, the preprocessing script will run very quickly. Beware that our logic for checking integrity of intermediate files is imperfect. If your call to preprocessing.bash is killed, intermediate files may be in a corrupt state, which may or may not be detectable. To be sure, either delete all intermediate files (rm example/test* example/chr*/test*, where test is the name -n test you chose for the run) and rerun the preprocessing script or run the preprocessing with the overwrite option (-w).
-o (DEFAULT: same as input directory): The directory where intermediate and output files will be stored.
-r (DEAFULT: chr): If your genome reference names include "chr" as prefix, you should set this reference prefix to the emptry string "" (two double quotes without space between them).
-s (DEFAULT: no): Save the intermediate files generated by the preprocessing script. Also see options -n and -w.
-w (DEFAULT: no): Overwrite any files from a previous run of the preprocessing script with the same name (option -n) and the save option -s.

Example: Tutorial Step 2 <a name = "preprocessing_example" />

After following the instructions in Tutorial Step 1 to get and place the data, you can run the preprocessing too

Related Skills

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding

last30days-skill

17.2k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

Tuteja-Lab

View profile

View on GitHub

GitHub Stars8

CategoryEducation

Updated2y ago

Forks3

Tuteja-Lab/UnsupervisedPeakCaller

Languages

Python

Security Score

55/100

Audited on Oct 19, 2023

No findings