Easym6A

Process m6A/MeRIP-seq data in a single or batch job mode

Generate Convert Improve

Install / Use

/learn @shunliubio/Easym6A

About this skill

Quality Score

0/100

README

easym6A: Process m6A/MeRIP-seq data in a single or batch job mode

easym6A creates a bash script to process m6A/MeRIP-seq data from adapter trimming to peak calling. It can also serve as a pipeline for RNA-seq data.

Dependencies:

Cutadapt, tested with 1.15
Samtools, tested with 1.7
HISAT2, tested with 2.1.0
Picard, tested with 2.17.10
bedtools, tested with 2.27.1
bedGraphToBigWig
StringTie, tested with 1.3.4d
prepDE.py
gffcompare, tested with 0.10.4
MACS2, tested with 2.1.1.20160309
HOMER, tested with 4.9
R, tested with 3.3.3. Install three libraries exomePeak, MeTPeak and MeTDiff in R.

Installation:

Add paths of executable programs of above dependencies, easym6A.pl and 3peakSuit.R to PATH. Assume that their paths are listed as follows:

Here is an example of appending them in ~/.bash_profile or ~/.profile.

PATH=~/.local/bin:~/software/samtools-1.7/bin:~/software/hisat2-2.1.0:~/software/picard-2.17.10:~/software/bedtools-2.27.1/bin:~/home/shunliu/software/ucsc:~/software/stringtie-1.3.4d.Linux_x86_64:~/software/gffcompare-0.10.4.Linux_x86_64:~/software/homer-4.9/bin:~/software/R-3.3-el7-x86_64/bin:~/software/easym6A:$PATH

export PATH

Note that Line 1 and 8 of 3peakSuit.R are the paths of Rscript and installed packages, respetively. They are also required to be consistent with R related environment variables.

3peakSuit.R Line 1

Modify ~/software/R-3.3-el7-x86_64/bin/Rscript to meet with the path of Rscript.

#!/usr/bin/env ~/software/R-3.3-el7-x86_64/bin/Rscript

3peakSuit.R Line 8

Modify ~/software/R-3.3-el7-x86_64/lib64/R/library to meet with the path of the R library where exomePeak, MeTPeak and MeTDiff are located.

.libPaths("~/software/R-3.3-el7-x86_64/lib64/R/library")

Usage:

1. Prepare the index for the reference genome and repetitive elements.

easym6A uses HISAT2 to map reads to the genome. One genome sequence file in fasta format and one genome annotation file in gtf format are required to build the index for the genome. More specification can be referred in the HISAT2 manual.

# Given that hg38.fa and gencode.v27.annotation.gtf are in the directory geonome and annotation, respectively.

# Generate index files into genome/hg38_tran_index
mkdir -p genome/hg38_tran_index
# Extract exons from the gtf file
extract_exons.py annotation/gencode.v27.annotation.gtf > genome/hg38_tran_index/gencode.v27.annotation.exon
# Extract splicing sites from the gtf file
extract_splice_sites.py annotation/gencode.v27.annotation.gtf > genome/hg38_tran_index/gencode.v27.annotation.ss
# Build the index for the genome
hisat2-build --ss genome/hg38_tran_index/gencode.v27.annotation.ss --exon genome/hg38_tran_index/gencode.v27.annotation.exon genome/hg38.fa genome/hg38_tran_index/hg38_tran

As easym6A can filter reads mapped to repetitive elements, one repetitive element sequence file in fasta format is required to build the index for repetitive elements, unless the filter step is ignored.

# Given that one would like to filter reads mapped to rRNAs. hg38_rRNA.fa is in the directory genome.

# Generate index files into genome/hg38_rRNA_index
mkdir -p genome/hg38_rRNA_index
# Build the index for rRNAs
hisat2-build genome/hg38_rRNA.fa genome/hg38_rRNA_index/hg38_rRNA

2. Create a table file including m6A/MeRIP-seq sample information

The table file is consisted of 19 fields, which records basic information of samples such as their file names, file paths, library types, etc. One row for one sample. It can be edited in EXCEL and then saved as plain text format (.txt). Here is an example of four HeLa samples (two input-IP pairs, one for control, the another for treatment):

| Species | Cell Line | Dataset | Sample | Group | File Name | Fastq File Dir | Fastq File | Bam File Path | 5’ Adapter | 3’ Adapter | 5’ Barcode | 3’ Barcode | Q33 | Strandness | Fragment Length | Read Length | Seq Layout | ID | |---------|-----------|---------|----------------|-----------------|----------------------|--------------------------|--------------------------------|----------------------------------------------|------------|------------|------------|------------|-----|------------|-----------------|-------------|------------|----| | human | HeLa | - | siControl_rep1 | control_input | input_siControl_rep1 | /your/raw/data/file/path | input_siControl.fastq.gz | /your/bam/file/path/input_siControl_rep1.bam | - | - | - | - | N | R | 150 | 50 | SINGLE | 1 | | human | HeLa | - | siControl_rep1 | control_ip | ip_siControl_rep1 | /your/raw/data/file/path | ip_siControl.fastq.gz | /your/bam/file/path/ip_siControl_rep1.bam | - | - | - | - | N | R | 150 | 50 | SINGLE | 2 | | human | HeLa | - | siMETTL3_rep1 | treatment_input | input_siMETTL3_rep1 | /your/raw/data/file/path | input_siMETTL3.fastq.gz | /your/bam/file/path/input_siMETTL3_rep1.bam | - | - | - | - | N | R | 150 | 50 | SINGLE | 3 | | human | HeLa | - | siMETTL3_rep1 | treatment_ip | ip_siMETTL3_rep1 | /your/raw/data/file/path | ip_siMETTL3.fastq.gz | /your/bam/file/path/ip_siMETTL3_rep1.bam | - | - | - | - | N | R | 150 | 50 | SINGLE | 4 |

Species: Optional. Not used in the pipeline.
Cell Line: Optional. Not used in the pipeline.
Dataset: Optional. Not used in the pipeline.
Sample: Optional. Not used in the pipeline.
Group: Optional. Not used in the pipeline.
File Name: Required. The name of the folder generated by easym6A.
Fastq File Dir: Required. The directory path of the fastq file.
Fastq File: Required. The name of the fastq file. For paired-end data, read 1 and read 2 file are seperated by comma.
Bam File Path: Optional or required. It is optional if de novo data processing is chosen or required if one prefers to perform peak calling using exsiting bam files (related to the option -k/--onlypeak. See below).
5' Adapter: Optional. The 5' adapter of the sequencing data. For paired-end data, 5' adapters of read 1 and read 2 are seperated by comma.
3' Adapter: Optional. The 3' adapter of the sequencing data. For paired-end data, 3' adapters of read 1 and read 2 are seperated by comma.
5' Barcode: Optional. The 5' barcode of the sequencing data. Note that the current version of easym6A does not support the option.
3' Barcode: Optional. The 3' barcode of the sequencing data. Note that the current version of easym6A does not support the option.
Q33: Required. Accepted value is either Y or N. Y for "Phred+33" quality encoding and N for "Phred+64" quality encoding.
Strandness: Required. Accepted value is F (forward strand), R (reverse strand) or U (unstranded) for single-end data, FR (forward strand), RF (reverse strand) or U (unstranded) for paired-end data. These values are consistent with those in the HISAT2 option --rna-strandness.
Fragment Length: Required. The insert size of the library. Used in peak calling tools.
Read Length: Required. The sequencing read length of the data. Used in both steps of gene quantification and peak calling.
Seq Layout: Required. Accepted value is either SINGLE for single-end data or PAIRED for paired-end data.
ID: Required. Sample ID for numbering samples. An important field as easym6A uses it to recognize which samples to be run.

Note that leave - in optional fields if the information is not available. Other characters or strings are not accepted.

3. Create a configuration file including paths of required files and output.

The configuration file contains only two columns and 12 rows. The first column is the key word field that is not allowed to be modified. Edit the second column to tell easym6A the locations of input files that are expected to be loaded and running results that are expected to be output.

| Field | Path | |----------------------|------------------------------------------------------| | bash_log_dir | /log/output/directory | | cutadapt_out_dir | /adapter/trimmed/data/output/directory | | hisat2_index_repBase | /repetitive/elements/hisat2/index/directory/basename | | hisat2_index_genome | /genome/sequences/hisat2/index/directory/basename | | hisat2_out_dir | /alignment/results/output/directory | | stringtie_out_dir | /gene/quantification/results/output/directory | | chrom_size | /chromosome/sizes/filename | | transcriptome_size | 100000000 | | gtf_file | /gene/annotation/gtf/filename | | bed12_file | /gene/annotation/bed12/filename | | peak_out_dir | /peak/calling/results/output/directory | | genome_fasta_file | /genome/sequences/fasta/filename |

bash_log_dir: Set a directory for a bash script and a log file that records running status of the bash script.
cutadapt_out_dir: Set a directory for adapter-trimmed data processed by Cutadapt.
**hisat

Related Skills

node-connect

342.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.7k

Commit, push, and open a PR