Edgecase
A framework for extracting telomeric reads from single-molecule sequencing experiments, describing their sequence variation and motifs, and for haplotype inference.
Install / Use
/learn @LankyCyril/EdgecaseREADME
edgeCase
edgeCase is a framework for extraction and interpretation of telomeric reads from long-read single-molecule whole genome sequencing datasets. Paper: https://genome.cshlp.org/content/31/7/1269

Installation
Obtaining code
The code can either be downloaded from the
releases page
(the SLIM tarball),
or cloned with git: git clone https://github.com/LankyCyril/edgecase
Environment setup
With Conda (preferred)
$ cd edgecase
$ conda env create --name edgecase --file environment.yaml
$ conda activate edgecase
$ ./edgecase
By manually installing dependencies
$ cd edgecase
$ pip install numpy scipy scikit-learn statsmodels edlib
$ pip install pandas matplotlib seaborn tqdm regex pysam
$ ./edgecase
Input data and formats
The extended reference genome
edgeCase works with SAM/BAM files aligned to a reference that is annotated with known subtelomeric regions and uses reads anchored to the outermost ends of subtelomeres (5' on the p arm, 3' on the q arm).
ref.fa.fai: a FAI index; create with "samtools faidx ref.fa"
ref.fa.ecx: an index containing annotations of subtelomere-telomere boundaries
ref.fa.ecx, a.k.a. the edgeCase indeX, describes anchors of interest in the reference genome; the format is based on the BED format. Usable "flag" values have to be among 4096 (hard mask), 8192 (fork), 16384 (telomeric tract). Two examples of ECX files can be found in the "assets" subdirectory.
Specifically, as described in the bioRxiv preprint, the human reference can be
constructed from the hg38/GRCh38 reference genome and subtelomeric assemblies
published by Stong et al., 2014.
To generate this reference, which we call "extended", or hg38ext, run
assets/generate-hg38ext.py --remote > hg38ext.fa.
Alignment files
We recommend using minimap2 to generate BAM files for edgeCase. Another option is winnowmap, but it has not been sufficiently tested yet.
NB: currently, it is imperative to supply a BAM file where secondary
alignment entries have read sequences. For example, minimap2 creates BAMs in
this format with the use of the -Y switch.
We plan to implement a workaround for this requirement in the future.
BAM files must also be indexed (i.e., have a .bai file created with
samtools index).
Custom SAM flags
edgeCase extends the zoo of SAM flags with four of its own. The full table of
flag names:
(also see https://broadinstitute.github.io/picard/explain-flags.html)
name | value | hex value | comment -------------------|-------|-----------|---------------------------------------------- paired | 1 | 0x0001 | SAM specification flag mapped_proper_pair | 2 | 0x0002 | SAM specification flag unmapped | 4 | 0x0004 | SAM specification flag mate_unmapped | 8 | 0x0008 | SAM specification flag rev | 16 | 0x0010 | SAM specification flag mate_rev | 32 | 0x0020 | SAM specification flag 1stmate | 64 | 0x0040 | SAM specification flag 2ndmate | 128 | 0x0080 | SAM specification flag secondary | 256 | 0x0100 | SAM specification flag qcfail | 512 | 0x0200 | SAM specification flag pcrdup | 1024 | 0x0400 | SAM specification flag supp | 2048 | 0x0800 | SAM specification flag mask_anchor | 4096 | 0x1000 | edgeCase-specific flag; added during pipeline fork | 8192 | 0x2000 | edgeCase-specific flag; added during pipeline tract_anchor | 16384 | 0x4000 | edgeCase-specific flag; added during pipeline is_q | 32768 | 0x8000 | edgeCase-specific flag; added during pipeline
NB: these flags are unused in the SAM specification and should not clash with
anything. samtools view can correctly subset using these flags.
Note: All edgeCase routines that allow flag filtering recognize both the numeric flag format (such as 3844) and the "human-readable" format such as "rev". Combinations are also understood, for example, "-F 3844 -F is_q".
Note: In the future, custom SAM flags may be superseded with tags.
A backwards compatibility layer will be provided (i.e., arguments like "-f fork"
or "-F 16384" will still work but interpret and produce appropriate tags).
The edgeCase pipeline
Usage: ./edgecase [-h | --help]
./edgecase <command> [<args>...]
Commmands (<command>):
tailpuller select overhanging long reads
tailchopper get overhanging heads/tails of long reads
repeatfinder discover enriched repeats in candidate sequences
kmerscanner perform scan of known kmers/motifs
densityplot visualize densities of candidate motifs
Development area:
entropy calculate motif entropy among long reads
levenshtein calculate pairwise edit distance among long reads
All commands output their results to stdout; you must pipe them into other commands or into the destination file. This applies even to outputs in PDF and PKL formats.
NB: Depending on the aligner used upstream, MAPQ of secondary reads may have
been set to zero regardless of real mapping quality; use this filtering option
with caution. This warning applies to all edgeCase subroutines that accept
the -q filtering flag.
tailpuller
Outputs a subset SAM file that contains only the reads that overhang anchors
defined in the ECX. If the read overhangs the mask anchor, the 4096 SAM flag is
added; for forks, 8192 is added; for telomeric tracts, 16384.
For reads on the q arm (i.e., on the 3' end), the 32768 flag is added (see
above for the full list and the explanation of flags).
Usage: ./edgecase tailpuller -x filename [-t targetspec]...
[-M integer] [--min-map-overlap integer]
[-m integer] [--min-telomere-overlap integer]
[--output-ambiguous-reads string]
[-f flagspec]... [-F flagspec]... [-q integer] <bam>
Required options:
-x, --index [filename] location of the reference .ecx index
Options:
-t, --target [targetspec] target reads overlapping these features (ECX flags) [default: tract_anchor]
-M, --max-read-length [integer] maximum read length to consider when selecting lookup regions
--min-map-overlap [integer] minimum overlap of reference to consider read as mapped [default: 1]
-m, --min-subtelomere-overlap [integer] minimum overlap of subtelomere to consider read as candidate [default: 1]
--min-telomere-overlap [integer] minimum overlap of telomere to consider read as candidate [default: 1]
--output-ambiguous-reads [string] which ambiguously mapping reads to retain (none, all, longest-overlap) [default: none]
Input filtering options:
-f, --flags [flagspec] process only entries with all these sam flags present [default: 0]
-F, --flag-filter [flagspec] process only entries with none of these sam flags present [default: 0]
-q, --min-quality [integer] process only entries with this MAPQ or higher [default: 0]
Suggestions:
- It is recommended to include secondary and supplementary reads (i.e., leave
the -F flag as default [0]), because:
- edgeCase determines unambiguously mapped reads on its own; aligners assign the 'supplementary' flag to multi-mapping reads arbitrarily, and removing such supplementary reads upstream may lead to loss of information in telomeric regions;
- edgeCase will discard chimeric reads in terminal regions if information about supplementary alignments is present.
- Supplying
--max-read-lengthdrastically improves wall time if reads are significantly shorter than chromosomes; for PacBio HiFi (CCS) it is suggested to use the value of 30000. If the value is not specified, edgeCase will assume infinity, and will have to go over the entire content of the BAM file. - Suggested value of --min-map-overlap for PacBio HiFi: 500.
- Suggested value of --min-(sub)telomere-overlap for PacBio HiFi: 3000.
- Pipe the output through
samtools view -bh -to compress on the fly.
tailchopper
Truncates reads in the tailpuller file either to soft/hard-clipped ends (when --target is "cigar"), or to sequences extending past given anchor (when --target is "tract_anchor", "fork", or "mask_anchor").
Outputs a SAM file with overhanging tails of candidate reads.
Usage: ./edgecase tailchopper -x filename [-t targetspec]
[-f flagspec]... [-F flagspec]... [-q integer] <bam>
Required options:
-x, --index [filename] location of the reference .ecx index
Options:
-t, --target [targetspec] an ECX flag (cut relative to reference) or 'cigar' [default: tract_anchor]
Input filtering options:
-f, --flags [flagspec] process only entries with all these sam flags present [default: 0]
-F, --flag-filter [flagspec] process only entries with none of these sam flags present [default: 0]
-q, --min-quality [integer] process only entries with this MAPQ or higher [default: 0]
NB: tailchopper outputs a SAM file with unmapped reads (sets the 0x0004 bit in the flag), but retains the original mapping position; do not use this value for downstream analyses unless you know exactly what you are after.
Suggestion: pipe the output through samtools view -bh - to compress on the
fly.
repeatfinder
Expects the SAM
Related Skills
node-connect
348.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
