Analysis of epigenetic signals captured by fragmentation patterns of cell-free DNA

The following sections provide details for the analyses performed in:

Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J. Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell. 2016 Jan 14;164(1-2):57-68. doi: 10.1016/j.cell.2015.11.050. PubMed PMID: 26771485

All scripts and binaries are provided as is, without any warrenty and for use at your own risk. This is not the release of a software package. We are only providing this information and code in addition to a description of methods for making it easier to reproduce our analyses. We are not providing any support for these scripts.

Versions and dependencies

Our scripts largely depend on Python 2, the pysam, bx and numpy libraries. Here a list of versions that we used:

python/2.7.3
numpy/1.7.0
pysam/0.7.5
bx-python/0.6.0

We were also using R 3.1.0, UCSC binaries for working with bigWig and bigBed files and tools like tabix and samtools from HTSlib.

Please consider using a software management tool like conda to install the respective packages. You can try more recent versions of the packages, but please keep in mind that Python 3 cannot be used to interpret Python 2.7 code.

Samtools version used

Please note that a samtools binary is included with these scripts. Among other things, this samtools binary allows filtering reads based on insert size/read length. This is an early version of the samtools branch released on https://github.com/mpieva/samtools-patched. For samtool calls that are not filtering for read length/insert size, other (more recent) versions of samtools might be used. Please note though that parameters might be named differently and that the ASCII encoding of read filters is also special to the samtools version that we used.

Primary data processing of sequencing data

Barcoded paired-end (PE) sequencing data was split allowing up to one substitution in the barcode sequence. Fragments shorter than or equal to the read length were consensus-called and adapter-trimmed. Remaining consensus single-end reads (SR) and the individual PE reads were aligned to the human reference genome (GRCh37, 1000 Genomes phase 2 technical reference, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/) using the ALN algorithm in BWA v0.7.10 (Li and Durbin, 2010). PE reads were further processed with BWA SAMPE to resolve ambiguous placement of read pairs or to rescue missing alignments by a more sensitive alignment step around the location of one placed read end. Aligned SR and PE data were stored in BAM format using the samtools API (Li et al., 2009). BAM files for each sample were merged across lanes and sequencing runs. BAM files for each sample were deposited in the NCBI Gene Expression Omnibus (GEO) with accession GSE71378 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71378). Note that GEO is distributing BAM files through SRA (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP061633 and http://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP061633). Update: SRA has recently converted the provided BAM files to SRA format. You can use the SRA tools to convert back. We are providing a copy of the original BAM files here.

Simulated reads and nucleotide frequencies

Aligned sequencing data was simulated (SR if shorter than 45 bp, PE 45 bp otherwise) for all major chromosomes of the human reference (GRC37h). Kmer nucleotide frequencies (k=2 is presented in the manuscript) were determined from real data on both fragment ends and both strand orientations (referenceKMers.py), and for the reference genome on both strands (BAM2FragKMers.py). The insert size distribution of the real data was extracted for the 1-500 bp range (BAM_RG_Length.py --noRG). Reads were simulated procedurally (implemented in simulate_reads.py): at each step (i.e., at least once at each genomic coordinate, depending on desired coverage), (1) the strand is randomly chosen, (2) the ratio of the kmer frequency in the real data to that in the reference sequence is used to randomly decide whether the initiating kmer is considered, (3) an length is sampled from the insert size distribution, and (4) the frequency ratio of the terminal kmer is used to randomly decide whether the generated alignment is reported. The simulated coverage was matched to that of the original data after PCR duplicate removal (by adjusting how often sequences are sampled at each position).

# Extract kmers (here 2mers) from reference genome 
./referenceKMers.py -k 2 -r '' -o grch37_regChroms_2mers.tsv

# Extact the length distribution from the sample BAM, keep only the 1-500bp range 
./BAM_RG_Length.py --noRG -p tmp_outfile BAMFILE.bam
head -n 501 tmp_outfile_ > ${SAMPLE}_lenDist.tsv

# Extract kmers (here 2mers) of left and right read ends from the SAMPLE file
./BAM2FragKMers.py --kmerLE=2 --kmerRE=2 -v -r 'ALL' BAMFILE.bam --outfileLE=${SAMPLE}_left_2mer --outfileRE=${SAMPLE}_right_2mer

# Run simulation for each "regular" chromosome (1..22, X, Y) -- should be run in parallel
for i in $(head -n 24 grch37_1000g_phase2/whole_genome.fa.fai | awk '{ print $1":1-"$2 }'); do \
  ./simulate_reads.py -v -d ${SAMPLE}_lenDist.tsv --fwdKMerGenome=grch37_regChroms_2mers.tsv --revKMerGenome=grch37_regChroms_2mers.tsv --fwdPKMers=${SAMPLE}_left_2mer_f.tsv --fwdMKMers=${SAMPLE}_left_2mer_r.tsv --revPKMers=${SAMPLE}_right_2mer_f.tsv --revMKMers=${SAMPLE}_right_2mer_r.tsv -s 20  -r \'"$i"\' -o ${SAMPLE}_2mer_sim_chr$( echo $i | cut -f 1 -d':').bam; \
done

# Combine chromosome files, index and regenerate some stats
./samtools merge -u ${SAMPLE}_chr*.bam | ./samtools sort - ${SAMPLE}_allChrom
./samtools index ${SAMPLE}_allChrom.bam
./samtools flagstat ${SAMPLE}_allChrom.bam > ${SAMPLE}_allChrom.bam_stats
./samtools view -F _2 ${SAMPLE}_allChrom.bam | cut -f 3 | uniq -c > ${SAMPLE}_allChrom.byChromCounts.txt
./BAM2FragmentationPatterns.py -r ALL -o ${SAMPLE}_allChrom.fragpatterns.txt ${SAMPLE}_allChrom.bam
./BAM_RG_Length.py --noRG -p ${SAMPLE}_allChrom.length ${SAMPLE}_allChrom.bam

Analysis of nucleotide composition of 167 bp fragments

Fragments with inferred lengths of exactly 167 bp were filtered within samples to remove duplicates. Dinucleotide frequencies were calculated in a strand-aware manner, using a sliding 2 bp window and reference alleles at each position, beginning 50 bp upstream of one fragment endpoint and ending 50 bp downstream of the other endpoint. Observed dinucleotide frequencies at each position were compared to expected dinucleotide frequencies determined from a set of simulated reads reflecting the same cleavage biases calculated in a library-specific manner.

See the folder nucleotide_composition for more details.

Coverage, fragment endpoints, and windowed protection scores

Fragment endpoint coordinates were extracted from BAM files with the SAMtools API. Both outer alignment coordinates of PE data were extracted for properly paired reads. Both end coordinates of SR alignments were extracted when PE data was collapsed to SR data by adapter trimming. A fragment’s coverage is defined as all positions between the two (inferred) fragment ends, inclusive of endpoints. We define the Windowed Protection Score (WPS) of a window of size k as the number of molecules spanning the window minus those with an endpoint within the window. We assign the determined WPS to the center of the window. For 35-80 bp fragments (short fraction, S-WPS), k=16; for 120-180 bp fragments (long fraction, L-WPS), k=120. We store coverage, read starts and WPS in gzip-compressed WIG files. We used wigToBigWig from the UCSC tools (http://hgdownload.cse.ucsc.edu/admin/exe/) for converting these to BigWig (bw) for visualization in genome viewers.

Running the extraction of coverage, read starts and WPS for the long fraction:

./samtools view -u -m 120 -M 180 BAMFILE.bam $EXTREGION | ./FilterUniqueBAM.py -p | ./extractReadStartsFromBAM2Wig.py -p -r $REGION -w 120 -c COVERAGE.wig.gz -n WPS.wig.gz -s STARTS.wig.gz

Running the extraction of coverage, read starts and WPS for the short fraction:

./samtools view -u -m 35 -M 80 BAMFILE.bam $EXTREGION | ./FilterUniqueBAM.py -p | ./extractReadStartsFromBAM2Wig.py -p -r $REGION -w 16 -c COVERAGE.wig.gz -n WPS.wig.gz -s STARTS.wig.gz

Please note the variables EXTREGION and REGION above. When piping reads from a region, we might miss the forward read of a read pair and the provided scripts usually only extract information from the first read of a pair. Thus, EXTREGION should include additional bases around the region (i.e. 200bp or another value that guarantees that all read insert sizes are included; here 180bp and 80bp would be sufficient; e.g. if REGION is 1:1000-2000, EXTREGION should be 1:800-2200).

Nucleosome peak calling

For nuclesome peak calling, the L-WPS is locally adjusted to a running median of zero in 1 kb windows and smoothed using a Savitzky-Golay filter (Savitzky and Golay, 1964) (window size 21, 2nd order polynomial). The L-WPS track is then segmented into abov

CfDNA

Install / Use

README