PED : Polymorphic Edge Detection

Polymorphic Edge Detection (PED) is the analysis flow for DNA polymorphism detection from short reads of next generation sequencer (NGS). I developed two methods to detect polymorphisms based on detection of the polymorphic edge. One is based on bidirectional alignment and the other is based on comparison of k-mers. Examples of PED result and useful information are shown in Web pages (English) (Japanese) (Paper)(Blog).

Polymorphic Edge

DNA polymorphism is any difference of DNA sequence between individuals. These differences are single nucleotide polymorphism (SNP), insertion, deletion, inversion, translocation and copy number variation. On the non-polymorphic region, sequences between two individuals are completely same. At the position of SNP, or at the beginning of other polymorphisms, the nucleotide must be different between individuals.

Bidirectional alignment method

                                                                Chr11 80443004
                                                                |
TTTTTAATTGAAAAGGCATTAAGCTGGGTCTATGCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGATAGGTAGAAAAAAAAAACCACTATCAGCAACA Reference sequence matching from 5'-end
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  | |     | ||||||||   |  |       |  
TTTTTAATTGAAAAGGCATTAAGCTGGGTCTATGCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGATAGGTAGAAAAAAAAAACCACTATCAGCAACAGT Short read sequence
|||       ||             |  | |     |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTAATTGAAAAGGCATTAAGCTGGGTCTATGCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGATAGGTAGAAAAAAAAAACCACTATCAGCAACAGT Reference sequence matching from 3'-end
                                   |
                                   Chr11 80442977

Short read sequence is aligned with reference sequence from both 5'- and 3'-ends. Positions indicated over and bellow of the alignment are first mismatched base, i.e., polymorphic edge. The bidirectional alignment clearly indicates two bases (GT) deletion in the short read. The bidirectional can detect not only deletion but also SNP, insertion, inversion and translocation.

K-mer method

Individual_A AAATGGTACATTTATATTAT
Individual_B AAATGGTACATTTATATTAC

All short reads from Individual_A and Individual_B are sliced to k-mer (e.g. k = 20) in each position. For example, the Individual_A has the k-mer sequence of AAATGGTACATTTATATTAT but does not have AAATGGTACATTTATATTAC. On the other hand, the Individual_B has the AAATGGTACATTTATATTAC but does not have AAATGGTACATTTATATTAT. The last base of k-mer of Individual_A is T, and Individual_B is C. The last base of k-mers must be SNP or edge of insertion, deletion, inversion, translocation or copy number variation. The k-mer method detects edges of polymorphism by difference of last base of k-mers. This method enables to detect polymorphisms by direct comparison of NGS data.

For analysis of SARS-CoV-2(COVID-19) data

perl download.pl accession=SRR11542244
perl ped.pl target=SRR11542244,ref=SARS-CoV-2

docker run -v `pwd`:/work -w /ped akiomiyao/ped perl download.pl accession=SRR11542244,wd=/work
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl ped.pl target=SRR11542244,ref=SARS-CoV-2,wd=/work

Run time of ped.pl is only two minutes for one accession using a standard desktop computer installed Linux (Ubuntu).
If you want to analyze your private sequences,

cd ped
mkdir your_sample_name
mkdir your_sample_name/read
cp somewhere/read_data.fastq your_sample_name/read
perl ped.pl target=your_sample_name,ref=SARS-CoV-2
or 
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl ped.pl target=your_sample_name,ref=SARS-CoV-2,wd=/work

Target name for ped.pl is the directory name.
Detailed Link for COVID-19 analysis

Simplified instruction

The ped.pl is a multithreaded (multiprocess) script, suitable for the multi-core CPU like as 4 or 8 cores.
Of course, the ped.pl can run with the 2 or single core machine, but slow.
The ped.pl runs on Linux (or FreeBSD) machine and Mac with at least 4 GB RAM and 1 TB hard disk (or SSD).

Following is a demonstration of spontaneous SNPs and SVs detection from a Caenorhabditis elegans with 250-times repeated generations.

cd ped
perl download.pl accession=ERR3063486
perl download.pl accession=ERR3063486
perl ped.pl target=ERR3063487,control=ERR3063486,ref=WBcel235

Installation of fastq-dump and ped scripts is described below.
The docker container for Linux includes fastq-dump and ped scripts.

docker run -v `pwd`:/work -w /ped akiomiyao/ped perl download.pl accession=ERR3063486,wd=/work
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl download.pl accession=ERR3063487,wd=/work
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl ped.pl target=ERR3063487,control=ERR3063486,ref=WBcel235,wd=/work

ERR3063487 sequence is after 250 generations of the nematode (ERR3063486).
BioPoject https://www.ncbi.nlm.nih.gov/bioproject/PRJEB30822
Downloading fastq files may take several hours, because connection of fastq-dump to NCBI-SRA is slow.
Sometimes, download.pl returns the timeout of network connection. In the case, network will be reconnected and resumed the download.
Fastq files will be saved in ERR3063486/read and ERR3063487/read.
Result of SNPs and SVs in ERR3063487 against ERR3063486, i.e. spontaneous mutations during 250 generations, will be saved in ERR3063487 (target) directory.
If control is omitted, polymorphisms against reference genome will be saved in target directory.
If script runs without arguments, description of how to use the script will be shown.
ERR3063487.vcf is the vcf format result. The vcf file can be opened by Integrative Genomics Viewer.
Options,
thread=8 : specify the max thread (process) number.
Default is the number of logical core.
tmpdir=/mnt/ssd : specify the temporally directory to /mnt/ssd. Default is target directory.
clipping=100 : If length of short reads is not fixed, ped.pl determine the suitable clipping length.
If you want to force the clipping length, add the clipping option.
Distribution of counts by sequence length can be obtained by check_length.pl
perl check_length.pl target=ERR3063487
Clipping length between 90-95% coverage is enough.
Current version of ped.pl has auto clipping function.
Result files,

File name               Description
ERR3063487.aln          Bidirectional alignment
ERR3063487.bi.primer    Primer data for PCR
ERR3063487.bi.snp       SNP data (original format)
ERR3063487.bi.snp.count SNP data (Showing snp counts from aln data)
ERR3063487.index        Index file for alignemt search
ERR3063487.log          Process log
ERR3063487.report       Log of ped.pl
ERR3063487.sv           Structural variation data
ERR3063487.sv.count     Structural variation data (Showing snp counts from aln data)
ERR3063487.sv.primer    Primer data for PCR
ERR3063487.vcf          SNP and SV data (vcf format, for IGV)
ERR3063487.full.vcf     SNP and SV data (vcf format, full output)
ERR3063487.count.vcf    SNP and SV data (vcf format, full output with unverified data)

For analyses of metagenome or mixed genome (e.g. SARS-CoV-2 data from a patient), using count data is recommended.
Because detected SNPs or SVs in closed position but on differenent genome strand may be filtered out during verification process.

Installation

If you do not want to use the docker container, downloading of programs is required.
Programs run on Unix platforms (FreeBSD or Linux) and Mac.
Download zip file of PED from https://github.com/akiomiyao/ped and extract.
or

git clone https://github.com/akiomiyao/ped.git

If your machine do not have git program, install git from package.

sudo apt install git (Ubontu)
sudo yum install git (CentOS)
sudo pkg install git (FreeBSD)

If you got scripts by clone command of git, update to newest version is very easy using pull command of git.

git pull

To download sequence data, fastq-dump from NCBI is required.
Tool kit can be download from
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Details of setup fastq-dump is described in
https://akiomiyao.github.io/ped/sratoolkit/index.html
To download reference data, curl is required.
If your machine do not have curl program, install curl from package.

sudo apt install curl (Ubontu)
sudo yum install curl (CentOS)
sudo pkg install curl (FreeBSD)

Setup of Docker (For Docker users, Optional)

If docker is installed, ped can be run with docker command without preinstall of ped.
https://docs.docker.com/install/linux/docker-ce/ubuntu/

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

sudo apt install docker
sudo apt install docker.io

To get or update the container,

sudo docker pull akiomiyao/ped

To check running containers,

sudo docker stats

To kill running container,

sudo docker kill Container_ID

If you want to run the docker container without sudo or su,

sudo usermod -a -G docker your_username

After the new login, docker commands can be execute with your account.

Supporting reference genomes

  Name             Description
  97103            Water melon (Citrullus lanatus subsp. vulgaris) cv. 97213v2
  Asagao1.2        Asagao (Ipomoea nil) Japanese morning glory
  B73v4            Corn (Zea mays B73) RefGen v4
  Bomo             Silkworm (B

Ped

Install / Use

README