AMBER — Ancient Metagenomic BinnER

AMBER bins metagenomic contigs from ancient DNA (aDNA) samples. Standard binners assume coverage and tetranucleotide frequency are clean signals; in aDNA they are not. AMBER models post-mortem damage and fragment length explicitly, so ancient and modern strains of the same genome end up in the same bin rather than scattered across several.

The problem: ancient DNA breaks metagenomic binning

Standard binners (MetaBAT2 [1], SemiBin2 [2], COMEBin [3]) rely on two signals: tetranucleotide frequency (genomic composition) and coverage depth (co-abundance across samples). Both are distorted in ancient DNA:

Damage-induced composition shift. Post-mortem deamination converts cytosines to uracil (read as T) at fragment termini [4]. C→T substitutions accumulate at the 5′ end and G→A at the 3′ end, changing the apparent tetranucleotide composition of every contig. Damaged and undamaged copies of the same genome are pushed apart in composition space.
Fragment length bias in coverage. Ancient reads are short (median ~40–80 bp) while modern reads are longer (~150–300 bp). Short reads map more ambiguously and cover terminal regions of contigs differently, producing systematically different coverage profiles for ancient and modern genomes even at equal abundance [5].
Mixed ancient/modern populations. Paleogenomic assemblies often contain reads from both ancient (damaged) and modern (undamaged) DNA, from environmental contamination, recent colonising organisms, or in situ DNA turnover. A binner unaware of this mixture splits ancient and modern reads of the same taxon into separate bins, or merges them with wrong neighbours [6].

AMBER addresses all three: damage-aware embeddings prevent composition distortion from separating contigs of the same genome, and amber deconvolve separates ancient from modern reads via Bayesian EM when needed.

How AMBER works

1. Feature extraction

For each contig (minimum 1,001 bp by default), AMBER extracts a 157-dimensional feature vector combining encoder output, aDNA-specific damage features, and multi-scale chaos game representation (CGR) features:

| Feature block | Dims | Description | |---------------|------|-------------| | Encoder output | 128 | SCG-guided damage-aware InfoNCE contrastive encoder | | C→T damage profile | 5 | C→T substitution rates at 5′ positions 1–5 | | G→A damage profile | 5 | G→A substitution rates at 3′ positions 1–5 | | Decay parameters | 2 | λ₅, λ₃: exponential damage decay constants fitted per contig | | Fragment length | 2 | Mean and standard deviation of aligned read lengths | | Damage coverage | 2 | Log-normalised read depth from ancient-classified (p > 0.6) and modern-classified (p < 0.4) reads | | Mismatch spectrum | 4 | T→C at 5′, C→T at 3′, other mismatches at 5′, other mismatches at 3′ | | CGR features | 9 | 6 cross-scale slopes (ΔH, ΔO, ΔL at 16→32 and 32→64 grid resolution) + absolute H₃₂, O₃₂, L₃₂ |

aDNA damage features (20 dimensions). The damage profile follows an exponential decay model, with terminal substitution rate at position p from the read end modelled as [7]:

$$\delta_p = d \cdot e^{-\lambda p}$$

where d is the damage amplitude and λ is the per-position decay constant. AMBER fits λ₅ and λ₃ independently for the 5′ and 3′ ends using the observed C→T and G→A rates at positions 1–5. Fragment length mean and standard deviation are computed from the aligned insert size distribution [5].

CGR features (9 dimensions). The chaos game representation [8] maps a nucleotide sequence onto the unit square by iterating toward one of four corner attractors (A, C, G, T). AMBER computes three metrics (Shannon entropy H, occupancy O as fraction of non-empty cells, and lacunarity L as coefficient of variation of cell densities) at grid resolutions 16×16, 32×32, and 64×64. The nine features are the three absolute values at 32×32 plus the six cross-scale slopes (Δmetric between adjacent resolutions), capturing sequence complexity across scales [9].

The encoder input is a 138-dimensional vector: 136 tetranucleotide frequencies (normalised, reverse-complement collapsed) + 2 coverage features (sqrt-transformed coverage variance and normalised mean depth). The 20 aDNA and 9 CGR dimensions bypass the encoder and are concatenated directly to the 128-dimensional encoder output, forming the 157-dimensional clustering space.

2. SCG-guided damage-aware contrastive learning

AMBER trains a contig encoder using InfoNCE [10] (contrastive predictive coding), extended with two aDNA-specific modifications: SCG-guided hard negative mining and damage-aware negative down-weighting.

Base encoder. The architecture follows COMEBin [3]: a MLP (138→2048→2048→2048→128, n_layer=3) with BatchNorm and LeakyReLU activations, with L2-normalised output embeddings. Six augmented views are generated per contig at training time (three coverage subsampling levels × two feature-noise intensities), matching the COMEBin augmentation protocol.

Standard InfoNCE loss [10] for a batch of B contigs with V views each:

$$\mathcal{L}{\text{InfoNCE}} = -\frac{1}{BV} \sum{i=1}^{BV} \log \frac{\exp(\text{sim}(z_i, z_{i^+})/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$

where z_i is the L2-normalised embedding of view i, i⁺ indexes the sibling views of the same contig, and τ is the temperature (default 0.1).

Modification 1: SCG hard negatives. Single-copy marker genes (SCGs) are typically present exactly once per genome. Two contigs both containing the same SCG are therefore likely from different genomes, making them high-confidence (though not guaranteed) negative pairs. AMBER amplifies these pairs in the InfoNCE denominator by a factor scg_boost (default 2.0), providing stronger repulsion signal for contigs that compositional embeddings might otherwise conflate [3]:

$$w_{ij}^{\text{SCG}} = \begin{cases} \alpha_{\text{boost}} & \text{if } M(i) \cap M(j) \neq \emptyset \ 1.0 & \text{otherwise} \end{cases}$$

where M(i) is the set of CheckM marker genes detected on contig i [11] and α_boost = 2.0.

Modification 2: Damage-aware negative down-weighting. Ancient and modern strains of the same taxon carry identical genomic sequence but different damage states. Without correction, the encoder learns to separate them based on C→T and G→A patterns rather than genomic sequence, pushing them into different bins. AMBER down-weights negatives that appear damage-incompatible, preventing damage state from becoming a discriminative feature:

$$w_{ij}^{\text{damage}} = 1 - \lambda_{\text{att}} \cdot c_i \cdot c_j \cdot (1 - f_{\text{compat}}(i, j))$$

where c_i = n_{\text{eff},i} / (n_{\text{eff},i} + n_0) is a read-depth confidence weight (low coverage → low confidence → weight approaches 1.0 regardless of damage), f_compat(i, j) is a symmetric damage compatibility score combining terminal C→T rates and p_ancient agreement, and λ_att controls attenuation strength.

The combined weight applied to each negative pair is:

$$w_{ij} = \max(w_{ij}^{\text{SCG}},; w_{ij}^{\text{damage}})$$

so SCG hard negatives are always amplified regardless of damage similarity, while non-SCG pairs are modulated by damage compatibility alone.

Consensus kNN. Three independent encoder restarts (different random seeds) are trained on the same data. Each produces an HNSW approximate nearest-neighbour graph [12] (k = 100, cosine distance, 157-dim feature space). Edge weights from the three graphs are averaged into a single consensus graph before Leiden clustering, reducing sensitivity to training stochasticity.

Marker gene database. SCG detection uses the 206 universal bacterial and archaeal single-copy marker HMM profiles from CheckM [11], bundled as auxiliary/checkm_markers_only.hmm. HMM search is performed with HMMER3 hmmsearch.

3. Quality-guided Leiden clustering

AMBER clusters the consensus kNN graph using the Leiden algorithm [13] with a three-phase quality refinement driven by SCG-based completeness and contamination estimates.

Phase 1: SCG-guided Leiden. Before each Leiden run, edges between contigs sharing any SCG marker are penalised:

$$w'{ij} = w{ij} \cdot e^{-\gamma \cdot n_{\text{shared}}}$$

with γ = 3.0 (default) and n_shared = |M(i) ∩ M(j)|. This pre-conditions the graph so that contigs from different genomes with similar embeddings are pushed apart before community detection runs.

The resolution parameter is swept over [0.5, 5.0] with 25 random Leiden seeds per resolution value. The configuration retained is the one that maximises a tiered quality score: strict-HQ bins (≥90% completeness, <5% contamination [14]) outrank pre-HQ bins (≥72% completeness, <5% contamination), which outrank MQ bins, which outrank raw completeness.

Phase 2: Contamination splitting. After Phase 1, any bin with excess SCG duplication (dup_excess > 0, meaning more copies of any marker than expected) is re-clustered at 3× the Phase 1 resolution on its contig subgraph. The split is accepted if total dup_excess decreases across the resulting sub-bins.

Phase 3: Near-HQ rescue. Bins estimated at 75–90% completeness are given one opportunity to recover missing SCG markers by recruiting kNN neighbours. A neighbouring contig is accepted into the target bin only if all of its SCG markers are absent from the current bin, ensuring no duplication is introduced.

Amber

Install / Use

README