Amber
AMBER - Ancient Metagenomic BinnER: damage-aware contrastive learning for ancient DNA binning
Install / Use
/learn @genomewalker/AmberREADME
AMBER — Ancient Metagenomic BinnER
AMBER bins metagenomic contigs from ancient DNA (aDNA) samples. Standard binners assume coverage and tetranucleotide frequency are clean signals; in aDNA they are not. AMBER models post-mortem damage and fragment length explicitly, so ancient and modern strains of the same genome end up in the same bin rather than scattered across several.
The problem: ancient DNA breaks metagenomic binning
Standard binners (MetaBAT2 [1], SemiBin2 [2], COMEBin [3]) rely on two signals: tetranucleotide frequency (genomic composition) and coverage depth (co-abundance across samples). Both are distorted in ancient DNA:
-
Damage-induced composition shift. Post-mortem deamination converts cytosines to uracil (read as T) at fragment termini [4]. C→T substitutions accumulate at the 5′ end and G→A at the 3′ end, changing the apparent tetranucleotide composition of every contig. Damaged and undamaged copies of the same genome are pushed apart in composition space.
-
Fragment length bias in coverage. Ancient reads are short (median ~40–80 bp) while modern reads are longer (~150–300 bp). Short reads map more ambiguously and cover terminal regions of contigs differently, producing systematically different coverage profiles for ancient and modern genomes even at equal abundance [5].
-
Mixed ancient/modern populations. Paleogenomic assemblies often contain reads from both ancient (damaged) and modern (undamaged) DNA, from environmental contamination, recent colonising organisms, or in situ DNA turnover. A binner unaware of this mixture splits ancient and modern reads of the same taxon into separate bins, or merges them with wrong neighbours [6].
AMBER addresses all three: damage-aware embeddings prevent composition distortion from separating contigs of the same genome, and amber deconvolve separates ancient from modern reads via Bayesian EM when needed.
How AMBER works
<p align="center"> <img src="amber_architecture.png" width="900" alt="AMBER pipeline architecture"> </p>1. Feature extraction
For each contig (minimum 1,001 bp by default), AMBER extracts a 157-dimensional feature vector combining encoder output, aDNA-specific damage features, and multi-scale chaos game representation (CGR) features:
| Feature block | Dims | Description | |---------------|------|-------------| | Encoder output | 128 | SCG-guided damage-aware InfoNCE contrastive encoder | | C→T damage profile | 5 | C→T substitution rates at 5′ positions 1–5 | | G→A damage profile | 5 | G→A substitution rates at 3′ positions 1–5 | | Decay parameters | 2 | λ₅, λ₃: exponential damage decay constants fitted per contig | | Fragment length | 2 | Mean and standard deviation of aligned read lengths | | Damage coverage | 2 | Log-normalised read depth from ancient-classified (p > 0.6) and modern-classified (p < 0.4) reads | | Mismatch spectrum | 4 | T→C at 5′, C→T at 3′, other mismatches at 5′, other mismatches at 3′ | | CGR features | 9 | 6 cross-scale slopes (ΔH, ΔO, ΔL at 16→32 and 32→64 grid resolution) + absolute H₃₂, O₃₂, L₃₂ |
aDNA damage features (20 dimensions). The damage profile follows an exponential decay model, with terminal substitution rate at position p from the read end modelled as [7]:
$$\delta_p = d \cdot e^{-\lambda p}$$
where d is the damage amplitude and λ is the per-position decay constant. AMBER fits λ₅ and λ₃ independently for the 5′ and 3′ ends using the observed C→T and G→A rates at positions 1–5. Fragment length mean and standard deviation are computed from the aligned insert size distribution [5].
CGR features (9 dimensions). The chaos game representation [8] maps a nucleotide sequence onto the unit square by iterating toward one of four corner attractors (A, C, G, T). AMBER computes three metrics (Shannon entropy H, occupancy O as fraction of non-empty cells, and lacunarity L as coefficient of variation of cell densities) at grid resolutions 16×16, 32×32, and 64×64. The nine features are the three absolute values at 32×32 plus the six cross-scale slopes (Δmetric between adjacent resolutions), capturing sequence complexity across scales [9].
The encoder input is a 138-dimensional vector: 136 tetranucleotide frequencies (normalised, reverse-complement collapsed) + 2 coverage features (sqrt-transformed coverage variance and normalised mean depth). The 20 aDNA and 9 CGR dimensions bypass the encoder and are concatenated directly to the 128-dimensional encoder output, forming the 157-dimensional clustering space.
2. SCG-guided damage-aware contrastive learning
AMBER trains a contig encoder using InfoNCE [10] (contrastive predictive coding), extended with two aDNA-specific modifications: SCG-guided hard negative mining and damage-aware negative down-weighting.
Base encoder. The architecture follows COMEBin [3]: a MLP (138→2048→2048→2048→128, n_layer=3) with BatchNorm and LeakyReLU activations, with L2-normalised output embeddings. Six augmented views are generated per contig at training time (three coverage subsampling levels × two feature-noise intensities), matching the COMEBin augmentation protocol.
Standard InfoNCE loss [10] for a batch of B contigs with V views each:
$$\mathcal{L}{\text{InfoNCE}} = -\frac{1}{BV} \sum{i=1}^{BV} \log \frac{\exp(\text{sim}(z_i, z_{i^+})/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$
where z_i is the L2-normalised embedding of view i, i⁺ indexes the sibling views of the same contig, and τ is the temperature (default 0.1).
Modification 1: SCG hard negatives. Single-copy marker genes (SCGs) are typically present exactly once per genome. Two contigs both containing the same SCG are therefore likely from different genomes, making them high-confidence (though not guaranteed) negative pairs. AMBER amplifies these pairs in the InfoNCE denominator by a factor scg_boost (default 2.0), providing stronger repulsion signal for contigs that compositional embeddings might otherwise conflate [3]:
$$w_{ij}^{\text{SCG}} = \begin{cases} \alpha_{\text{boost}} & \text{if } M(i) \cap M(j) \neq \emptyset \ 1.0 & \text{otherwise} \end{cases}$$
where M(i) is the set of CheckM marker genes detected on contig i [11] and α_boost = 2.0.
Modification 2: Damage-aware negative down-weighting. Ancient and modern strains of the same taxon carry identical genomic sequence but different damage states. Without correction, the encoder learns to separate them based on C→T and G→A patterns rather than genomic sequence, pushing them into different bins. AMBER down-weights negatives that appear damage-incompatible, preventing damage state from becoming a discriminative feature:
$$w_{ij}^{\text{damage}} = 1 - \lambda_{\text{att}} \cdot c_i \cdot c_j \cdot (1 - f_{\text{compat}}(i, j))$$
where c_i = n_{\text{eff},i} / (n_{\text{eff},i} + n_0) is a read-depth confidence weight (low coverage → low confidence → weight approaches 1.0 regardless of damage), f_compat(i, j) is a symmetric damage compatibility score combining terminal C→T rates and p_ancient agreement, and λ_att controls attenuation strength.
The combined weight applied to each negative pair is:
$$w_{ij} = \max(w_{ij}^{\text{SCG}},; w_{ij}^{\text{damage}})$$
so SCG hard negatives are always amplified regardless of damage similarity, while non-SCG pairs are modulated by damage compatibility alone.
Consensus kNN. Three independent encoder restarts (different random seeds) are trained on the same data. Each produces an HNSW approximate nearest-neighbour graph [12] (k = 100, cosine distance, 157-dim feature space). Edge weights from the three graphs are averaged into a single consensus graph before Leiden clustering, reducing sensitivity to training stochasticity.
Marker gene database. SCG detection uses the 206 universal bacterial and archaeal single-copy marker HMM profiles from CheckM [11], bundled as auxiliary/checkm_markers_only.hmm. HMM search is performed with HMMER3 hmmsearch.
3. Quality-guided Leiden clustering
AMBER clusters the consensus kNN graph using the Leiden algorithm [13] with a three-phase quality refinement driven by SCG-based completeness and contamination estimates.
Phase 1: SCG-guided Leiden. Before each Leiden run, edges between contigs sharing any SCG marker are penalised:
$$w'{ij} = w{ij} \cdot e^{-\gamma \cdot n_{\text{shared}}}$$
with γ = 3.0 (default) and n_shared = |M(i) ∩ M(j)|. This pre-conditions the graph so that contigs from different genomes with similar embeddings are pushed apart before community detection runs.
The resolution parameter is swept over [0.5, 5.0] with 25 random Leiden seeds per resolution value. The configuration retained is the one that maximises a tiered quality score: strict-HQ bins (≥90% completeness, <5% contamination [14]) outrank pre-HQ bins (≥72% completeness, <5% contamination), which outrank MQ bins, which outrank raw completeness.
Phase 2: Contamination splitting. After Phase 1, any bin with excess SCG duplication (dup_excess > 0, meaning more copies of any marker than expected) is re-clustered at 3× the Phase 1 resolution on its contig subgraph. The split is accepted if total dup_excess decreases across the resulting sub-bins.
Phase 3: Near-HQ rescue. Bins estimated at 75–90% completeness are given one opportunity to recover missing SCG markers by recruiting kNN neighbours. A neighbouring contig is accepted into the target bin only if all of its SCG markers are absent from the current bin, ensuring no duplication is introduced.
