SomaticoEP1
Exercício Somático EP1
Install / Use
/learn @renatopuga/SomaticoEP1README
Pipeline somático: Exercício 1
Tempo de Processamento
| Infraestrutura | CPU | Memória (GB) | Storage (GB) | Etapa | Tempo | | -------------- | ---- | ------------ | ------------ | ----------------- | --------------- | | gitpod | 16 | 64 | ?DD 30 | FASTQ -> SAM | 55-61min (chr9) | | | | | | SAM -> BAM | 13-18min | | | | | | BAM -> SORT_BAM | ~6-12min | | | | | | SORT_BAM -> RMDUP | ~12 min | | servidor local | 16 | 32 | SSD 30 | todas as etapas | ~2h (hg19) |
Shotgun e Amplicon
<img width="1438" alt="Screen Shot 2022-10-28 at 22 55 11" src="https://user-images.githubusercontent.com/8321336/198762626-8a788ca0-d5f9-495b-a96b-7ae807ba071b.png">Workflow
-
O arquivo no NCBI da amostra WP312 (tumor)
-
Projeto SRA: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP190048&o=bases_l%3Aa
-
ID: SRR8856724
-
Precisa instalar o sratoolkit (dentro do gitpod)
-
# instalar utilizando o brew install (gitpod) brew install sratoolkit # parallel fastq-dump # https://github.com/rvalieris/parallel-fastq-dump pip install parallel-fastq-dump # caso o vdb-config não funcione wget -c https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz tar -zxvf sratoolkit.3.0.0-ubuntu64.tar.gz export PATH=$PATH://workspace/somaticoEP1/sratoolkit.3.0.0-ubuntu64/bin/ echo "Aexyo" | sratoolkit.3.0.0-ubuntu64/bin/vdb-config -
# fastq-dump: comando que faz o download do arquivo utilizando o SRR ID da amostra # --sra-id: SRR # --threads 4: paraleliza em 4 partes o download # --gzip: fastq compactados # --split-files: ele vai separar os arquivos fastq em 1 e 2 (paired) time parallel-fastq-dump --sra-id SRR8856724 --threads 4 --outdir ./ --split-files --gzip -
Google Colab sratoolkit (modo alternativo)
-
-
# download do binário ubuntu 64bits
%%bash
wget -c https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz
# -z unzip
# -x extract
# -v verbose
# -f force?
tar -zxvf sratoolkit.3.0.0-ubuntu64.tar.gz
# validate
./sratoolkit.3.0.0-ubuntu64/bin/vdb-config
# adicionando o diretorio dos programas no path
export PATH=$PATH:/content/sratoolkit.3.0.0-ubuntu64/bin/
# downlaod parallel
parallel-fastq-dump --sra-id SRR8856724 --threads 6 --outdir ./ --split-files --gzip --tmpdir /content/
-
AS Referências do Genoma hg38 (FASTA, VCFs)
-
Os arquivos de Referência: Panel of Normal (PoN), Gnomad AF e Exac common:
- https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38?project=broad-dsde-outreach
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gzwget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbiwget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gzwget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz.tbiwget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gzwget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz.tbi -
Arquivo no formato FASTA do genoma humano hg38
-
Diretório Download UCSC hg38: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/
-
chr9.fa.gz: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr9.fa.gz
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr9.fa.gz
-
-
-
-
BWA para mapeamento dos arquivos FASTQ (gitpod)
- Instalando o BWA no gitpod
brew install bwa- Google Colab
Abri um CODE
!sudo apt-get install bwa -
BWA index do arquivo chr9.fa.gz
# descompacta o arquivo gz gunzip chr9.fa.gz # bwa index para indexar o arquivo .fa (~5min) bwa index chr9.fa -
Samtools faidx
- Install Gitpod
brew install samtools- Install Google Colab
!sudo apt-get install samtools- Samtools faidx chr9.fa
samtools faidx chr9.fa -
BWA-mem para fazer o alinhamento (FASTQ -> BAM)
NOME=WP312; Biblioteca=Nextera; Plataforma=illumina;
bwa mem -t 16 -M -R "@RG\tID:$NOME\tSM:$NOME\tLB:$Biblioteca\tPL:$Plataforma" \
chr9.fa \
SRR8856724_1.fastq.gz \
SRR8856724_2.fastq.gz > WP312.sam
samtools: fixmate, sort e index (~min)
# -@ numero de cores utilizados
time samtools fixmate -@10 WP312.sam WP312.bam
# ordenando o arquivo fixmate
time samtools sort -O bam -@6 -m2G -o WP312_sorted.bam WP312.bam
# indexando o arquivo BAM ordenado (sort)
time samtools index WP312_sorted.bam
# abordagem de target sequencing utilizamos o rmdup para remover duplicata de PCR
time samtools rmdup WP312_sorted.bam WP312_sorted_rmdup_F4.bam
# indexando o arquivo BAM rmdup
time samtools index WP312_sorted_rmdup_F4.bam
Alternativa: combinar com pipes: bwa + samtools view e sort
bwa mem -t 10 -M -R "@RG\tID:$NOME\tSM:$NOME\tLB:$Biblioteca\tPL:$Plataforma" chr9.fa SRR8856724_1.fastq.gz SRR8856724_2.fastq.gz | samtools view -F4 -Sbu -@2 - | samtools sort -m4G -@2 -o WP312_sorted.bam
NOTA: se utilizar a opção alternativa, não esquecer de rodar o samtools para as etapas: rmdup e index (do rmdup).
- Cobertura - make BED files
Instalação do bedtools
Gitpod
brew install bedtools
Google Colab
!sudo apt-get install bedtools
Gerando BED do arquivo BAM
bedtools bamtobed -i WP312_sorted_rmdup_F4.bam > WP312_sorted_rmdup.bed
bedtools merge -i WP312_sorted_rmdup.bed > WP312_sorted_rmdup_merged.bed
bedtools sort -i WP312_sorted_rmdup_merged.bed > WP312_sorted_rmdup_merged_sorted.bed
Cobertura Média
Pegando arquivo view -F4 do BAM WP312.
git clone https://github.com/circulosmeos/gdown.pl.git
./gdown.pl/gdown.pl https://drive.google.com/file/d/1pTMpZ2eIboPHpiLf22gFIQbXU2Ow26_E/view?usp=drive_link WP312_sorted_rmdup_F4.bam
./gdown.pl/gdown.pl https://drive.google.com/file/d/10utrBVW-cyoFPt5g95z1gQYQYTfXM4S7/view?usp=drive_link WP312_sorted_rmdup_F4.bam.bai
bedtools coverage -a WP312_sorted_rmdup_merged_sorted.bed \
-b WP312_sorted_rmdup_F4.bam -mean \
> WP312_coverageBed.bed
Filtro por total de reads >=20
cat WP312_coverageBed.bed | \
awk -F "\t" '{if($4>=20){print}}' \
> WP312_coverageBed20x.bed
- GATK4 instalação (MuTect2 com PoN)
Download
wget -c https://github.com/broadinstitute/gatk/releases/download/4.2.2.0/gatk-4.2.2.0.zip
Descompactar
unzip gatk-4.2.2.0.zip
Gerar arquivo .dict
./gatk-4.2.2.0/gatk CreateSequenceDictionary -R chr9.fa -O chr9.dict
Gerar interval_list do chr9
./gatk-4.2.2.0/gatk ScatterIntervalsByNs -R chr9.fa -O chr9.interval_list -OT ACGT
Converter Bed para Interval_list
./gatk-4.2.2.0/gatk BedToIntervalList -I WP312_coverageBed20x.bed \
-O WP312_coverageBed20x.interval_list -SD chr9.dict
GATK4 - CalculateContamination
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I WP312_sorted_rmdup_F4.bam \
-V af-only-gnomad.hg38.vcf.gz \
-L WP312_coverageBed20x.interval_list \
-O WP312.table
./gatk-4.2.2.0/gatk CalculateContamination \
-I WP312.table \
-O WP312.contamination.table
GATK4 - MuTect2 Call
./gatk-4.2.2.0/gatk Mutect2 \
-R chr9.fa \
-I WP312_sorted_rmdup_F4.bam \
--germline-resource af-only-gnomad.hg38.vcf.gz \
--panel-of-normals 1000g_pon.hg38.vcf.gz \
-L WP312_coverageBed20x.interval_list \
-O WP312.somatic.pon.vcf.gz
GATK4 - MuTect2 FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
-R chr9.fa \
-V WP312.somatic.pon.vcf.gz \
--contamination-table WP312.contamination.table \
-O WP312.filtered.pon.vcf.gz
somaticoEP1 - LiftOverVCF
Download dos arquivos VCFs da versão hg19 da análise antiga do Projeto LMA Brasil:
https://drive.google.com/drive/folders/1m2qmd0ca2Nwb7qcK58ER0zC8-1_9uAiE?usp=sharing
WP312.filtered.vcf.gz
WP312.filtered.vcf.gz.tbi
Download liftOver
- MacOS: https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver
wget -c https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver
- Linux x86_64: https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget -c https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver
Alterar a permissão de execução do arquivo liftOver
chmod +x liftOver
Para testar execute:
./liftOver
liftOver - Move annotations from one assembly to another
usage:
liftOver oldFile map.chain newFile unMapped
oldFile and newFile are in bed format by default, but can be in GFF and
maybe eventually others with the appropriate flags below.
The map.chain file has the old genome as the target and the new genome
as the query.
***********************************************************************
WARNING: liftOver was
