SkillAgentSearch skills...

Edgecase

A framework for extracting telomeric reads from single-molecule sequencing experiments, describing their sequence variation and motifs, and for haplotype inference.

Install / Use

/learn @LankyCyril/Edgecase

README

edgeCase

edgeCase is a framework for extraction and interpretation of telomeric reads from long-read single-molecule whole genome sequencing datasets. Paper: https://genome.cshlp.org/content/31/7/1269

haplotypes_example

Installation

Obtaining code

The code can either be downloaded from the releases page (the SLIM tarball),
or cloned with git: git clone https://github.com/LankyCyril/edgecase

Environment setup

With Conda (preferred)

$ cd edgecase
$ conda env create --name edgecase --file environment.yaml
$ conda activate edgecase
$ ./edgecase

By manually installing dependencies

$ cd edgecase
$ pip install numpy scipy scikit-learn statsmodels edlib
$ pip install pandas matplotlib seaborn tqdm regex pysam
$ ./edgecase

Input data and formats

The extended reference genome

edgeCase works with SAM/BAM files aligned to a reference that is annotated with known subtelomeric regions and uses reads anchored to the outermost ends of subtelomeres (5' on the p arm, 3' on the q arm).

ref.fa.fai:  a FAI index; create with "samtools faidx ref.fa"
ref.fa.ecx:  an index containing annotations of subtelomere-telomere boundaries

ref.fa.ecx, a.k.a. the edgeCase indeX, describes anchors of interest in the reference genome; the format is based on the BED format. Usable "flag" values have to be among 4096 (hard mask), 8192 (fork), 16384 (telomeric tract). Two examples of ECX files can be found in the "assets" subdirectory.

Specifically, as described in the bioRxiv preprint, the human reference can be constructed from the hg38/GRCh38 reference genome and subtelomeric assemblies published by Stong et al., 2014. To generate this reference, which we call "extended", or hg38ext, run assets/generate-hg38ext.py --remote > hg38ext.fa.

Alignment files

We recommend using minimap2 to generate BAM files for edgeCase. Another option is winnowmap, but it has not been sufficiently tested yet.

NB: currently, it is imperative to supply a BAM file where secondary alignment entries have read sequences. For example, minimap2 creates BAMs in this format with the use of the -Y switch.
We plan to implement a workaround for this requirement in the future.

BAM files must also be indexed (i.e., have a .bai file created with samtools index).

Custom SAM flags

edgeCase extends the zoo of SAM flags with four of its own. The full table of flag names:
(also see https://broadinstitute.github.io/picard/explain-flags.html)

name | value | hex value | comment -------------------|-------|-----------|---------------------------------------------- paired | 1 | 0x0001 | SAM specification flag mapped_proper_pair | 2 | 0x0002 | SAM specification flag unmapped | 4 | 0x0004 | SAM specification flag mate_unmapped | 8 | 0x0008 | SAM specification flag rev | 16 | 0x0010 | SAM specification flag mate_rev | 32 | 0x0020 | SAM specification flag 1stmate | 64 | 0x0040 | SAM specification flag 2ndmate | 128 | 0x0080 | SAM specification flag secondary | 256 | 0x0100 | SAM specification flag qcfail | 512 | 0x0200 | SAM specification flag pcrdup | 1024 | 0x0400 | SAM specification flag supp | 2048 | 0x0800 | SAM specification flag mask_anchor | 4096 | 0x1000 | edgeCase-specific flag; added during pipeline fork | 8192 | 0x2000 | edgeCase-specific flag; added during pipeline tract_anchor | 16384 | 0x4000 | edgeCase-specific flag; added during pipeline is_q | 32768 | 0x8000 | edgeCase-specific flag; added during pipeline

NB: these flags are unused in the SAM specification and should not clash with anything. samtools view can correctly subset using these flags.

Note: All edgeCase routines that allow flag filtering recognize both the numeric flag format (such as 3844) and the "human-readable" format such as "rev". Combinations are also understood, for example, "-F 3844 -F is_q".

Note: In the future, custom SAM flags may be superseded with tags.
A backwards compatibility layer will be provided (i.e., arguments like "-f fork" or "-F 16384" will still work but interpret and produce appropriate tags).

The edgeCase pipeline

Usage: ./edgecase [-h | --help]
       ./edgecase <command> [<args>...]

Commmands (<command>):
    tailpuller               select overhanging long reads
    tailchopper              get overhanging heads/tails of long reads
    repeatfinder             discover enriched repeats in candidate sequences
    kmerscanner              perform scan of known kmers/motifs
    densityplot              visualize densities of candidate motifs

Development area:
    entropy                  calculate motif entropy among long reads
    levenshtein              calculate pairwise edit distance among long reads

All commands output their results to stdout; you must pipe them into other commands or into the destination file. This applies even to outputs in PDF and PKL formats.

NB: Depending on the aligner used upstream, MAPQ of secondary reads may have been set to zero regardless of real mapping quality; use this filtering option with caution. This warning applies to all edgeCase subroutines that accept the -q filtering flag.

tailpuller

Outputs a subset SAM file that contains only the reads that overhang anchors defined in the ECX. If the read overhangs the mask anchor, the 4096 SAM flag is added; for forks, 8192 is added; for telomeric tracts, 16384.
For reads on the q arm (i.e., on the 3' end), the 32768 flag is added (see above for the full list and the explanation of flags).

Usage: ./edgecase tailpuller -x filename [-t targetspec]...
                            [-M integer] [--min-map-overlap integer]
                            [-m integer] [--min-telomere-overlap integer]
                            [--output-ambiguous-reads string]
                            [-f flagspec]... [-F flagspec]... [-q integer] <bam>

Required options:
    -x, --index [filename]                   location of the reference .ecx index

Options:
    -t, --target [targetspec]                target reads overlapping these features (ECX flags) [default: tract_anchor]
    -M, --max-read-length [integer]          maximum read length to consider when selecting lookup regions
    --min-map-overlap [integer]              minimum overlap of reference to consider read as mapped [default: 1]
    -m, --min-subtelomere-overlap [integer]  minimum overlap of subtelomere to consider read as candidate [default: 1]
    --min-telomere-overlap [integer]         minimum overlap of telomere to consider read as candidate [default: 1]
    --output-ambiguous-reads [string]        which ambiguously mapping reads to retain (none, all, longest-overlap) [default: none]

Input filtering options:
    -f, --flags [flagspec]                   process only entries with all these sam flags present [default: 0]
    -F, --flag-filter [flagspec]             process only entries with none of these sam flags present [default: 0]
    -q, --min-quality [integer]              process only entries with this MAPQ or higher [default: 0]

Suggestions:

  • It is recommended to include secondary and supplementary reads (i.e., leave the -F flag as default [0]), because:
    • edgeCase determines unambiguously mapped reads on its own; aligners assign the 'supplementary' flag to multi-mapping reads arbitrarily, and removing such supplementary reads upstream may lead to loss of information in telomeric regions;
    • edgeCase will discard chimeric reads in terminal regions if information about supplementary alignments is present.
  • Supplying --max-read-length drastically improves wall time if reads are significantly shorter than chromosomes; for PacBio HiFi (CCS) it is suggested to use the value of 30000. If the value is not specified, edgeCase will assume infinity, and will have to go over the entire content of the BAM file.
  • Suggested value of --min-map-overlap for PacBio HiFi: 500.
  • Suggested value of --min-(sub)telomere-overlap for PacBio HiFi: 3000.
  • Pipe the output through samtools view -bh - to compress on the fly.

tailchopper

Truncates reads in the tailpuller file either to soft/hard-clipped ends (when --target is "cigar"), or to sequences extending past given anchor (when --target is "tract_anchor", "fork", or "mask_anchor").

Outputs a SAM file with overhanging tails of candidate reads.

Usage: ./edgecase tailchopper -x filename [-t targetspec]
                            [-f flagspec]... [-F flagspec]... [-q integer] <bam>

Required options:
    -x, --index [filename]        location of the reference .ecx index

Options:
    -t, --target [targetspec]     an ECX flag (cut relative to reference) or 'cigar' [default: tract_anchor]

Input filtering options:
    -f, --flags [flagspec]        process only entries with all these sam flags present [default: 0]
    -F, --flag-filter [flagspec]  process only entries with none of these sam flags present [default: 0]
    -q, --min-quality [integer]   process only entries with this MAPQ or higher [default: 0]

NB: tailchopper outputs a SAM file with unmapped reads (sets the 0x0004 bit in the flag), but retains the original mapping position; do not use this value for downstream analyses unless you know exactly what you are after.

Suggestion: pipe the output through samtools view -bh - to compress on the fly.

repeatfinder

Expects the SAM

Related Skills

View on GitHub
GitHub Stars16
CategoryDevelopment
Updated6mo ago
Forks2

Languages

Python

Security Score

92/100

Audited on Oct 6, 2025

No findings