MATES
A Deep Learning-Based Model for Quantifying Transposable Elements in Single-Cell Sequencing Data
Install / Use
/learn @mcgilldinglab/MATESREADME
MATES
A Deep Learning-Based Model for Quantifying Transposable Elements in Single-Cell Sequencing Data (Nature Communications, 2024).
Citation
If you use MATES in your research, please cite MATES publication as follows:
Wang, R., Zheng, Y., Zhang, Z. et al. MATES: a deep learning-based model for locus-specific quantification of transposable elements in single cell. Nat Commun 15, 8798 (2024). https://doi.org/10.1038/s41467-024-53114-7
Overview
<img title="Model Overview" alt="Alt text" src="/figures/Model-figure-01.png">Transposable elements (TEs) are crucial for genetic diversity and gene regulation. Current single-cell quantification methods often align multi-mapping reads to either ‘best-mapped’ or ‘random-mapped’ locations and categorize them at the subfamily levels, overlooking the biological necessity for accurate, locus-specific TE quantification. Moreover, these existing methods are primarily designed for and focused on transcriptomics data, which restricts their adaptability to single-cell data of other modalities. To address these challenges, here we introduce MATES, a deep-learning approach that accurately allocates multi-mapping reads to specific loci of TEs, utilizing context from adjacent read alignments flanking the TE locus. When applied to diverse single-cell omics datasets, MATES shows improved performance over existing methods, enhancing the accuracy of TE quantification and aiding in the identification of marker TEs for identified cell populations. This development facilitates the exploration of single-cell heterogeneity and gene regulation through the lens of TEs, offering an effective transposon quantification tool for the single-cell genomics community.
<!-- MATES is a specialized tool designed for precise quantification of transposable elements (TEs) in various single-cell datasets. The workflow consists of multiple stages to ensure accurate results. In the initial phase, raw reads are mapped to the reference genome, differentiating between unique-mapping and multi-mapping reads associated with TE loci. Unique-mapping reads create coverage vectors (V<sub>u</sub>), while multi-mapping reads remain associated with V<sub>m</sub> vectors, both capturing read distribution around TEs. TEs are then divided into bins, either unique-dominant (U) or multi-dominant (M), based on read proportion. An autoEncoder model is employed to create latent embeddings (Z<sub>m</sub>) capturing local read context and is combined with TE family information (T<sub>k</sub>). In the subsequent stage, the obtained embeddings are used to jointly estimate the multi-mapping ratio (α<sub>i</sub>) via a multilayer perceptron. Training the model involves a global loss (L<sub>1</sub> and L<sub>2</sub>) comprising reconstruction loss and read coverage continuity. Trained to predict multi-mapping ratios, the model counts reads in TE regions, enabling probabilistic TE quantification at the single-cell level. MATES enhances cell clustering and biomarker identification by integrating TE quantification with gene expression methods. --> <!-- With the burgeoning field of single-cell sequencing data, the potential for in-depth TE quantification and analysis is enormous, opening avenues to gain invaluable insights into the molecular mechanisms underpinning various human diseases. MATES furnishes a powerful tool for accurately quantifying and investigating TEs at specific loci and single-cell level, thereby significantly enriching our understanding of complex biological processes. This opens a new dimension for genomics and cell biology research and holds promise for potential therapeutic breakthroughs. -->Relesae Note
- Version 0.1.8: Support TE quantification for data without sufficient multi-mapping TE reads.
- Version 0.1.7: Parallelize preprocessing for 10X-format data.
- Version 0.1.6: Add a simple mode for MATES to quantify TE within 3 lines of code. Add a common errors Q&A.
- Version 0.1.5: Improve the efficiency of splitting BAM files and counting TEs reads.
MATES is actively under development; please feel free to reach out if you encounter any issue.
Installation
Installing MATES
To install MATES, you can run the following command:
# Clone the MATES repository
git clone https://github.com/mcgilldinglab/MATES.git
# Create a new environment
conda create -n mates_env python=3.9
conda activate mates_env
# Install required packages
conda install -c bioconda samtools -y
conda install -c bioconda bedtools -y
# Install MATES
cd MATES
pip install .
# Add environment to Jupyter Notebook
conda install ipykernel
python -m ipykernel install --user --name=mates_env
Installation should take only a few minutes. Verify installation:
import MATES
Usage
Simple mode
Use the all in one MATES_pipeline. Please read our Examples and APIs for details.
from MATES import MATES_pipeline
mates = MATES_pipeline(TE_mode, data_mode, sample_list_file, bam_path_file) #set up parameters
mates.preprocessing() #Preprocessing
mates.run() #train model and quantify both subfamily and locus-level TE expression
Advanced mode
Count coverage vector, Determine U/M region, Generate training and prediction data, Train models, Quantify sub-family level TEs, and Quantify locus_level TEs step by step. Please read our Examples and tutorials.
Common Q&A
If you encounter errors when using MATES, please read our common Q&A.
Tutorials
Customize the reference genome for the species of interest
We have supported automatic human and mouse TE/Gene reference genome creating using python build_reference.py --species Human/Mouse. For Arabidopsis thaliana and Drosophila melanogaster, please visit the shared folder for GTF file, RepeatMaskers file, and example script to create their TE/Gene reference genome. For other species, please refer to the tutorial of building TE and Gene reference genome.
Walkthrough Example
From loading data to downstream analysis. Please refer to Example Section for deatils.
Example scripts for different type of single cell data
- MATES pipeline on Smart-seq2 scRNA and 10X scRNA/scATAC/Multi-Omics data
- MATES pipeline on 10X Visium data
- MATES pipeline on Long Reads data
10x scRNA-seq dataset
Smart-seq2 scRNA dataset
- MATES downstream analysis on Smart-seq2 scRNA data (TE only)
- MATES downstream analysis on Smart-seq2 scRNA data (Gene+TE)
10x scATAC-seq dataset
APIs
The MATES contains six modules.
import MATES
from MATES import bam_processor
from MATES import data_processor
from MATES import MATES_model
from MATES import TE_quantifier
from MATES import TE_quantifier_LongRead
from MATES import TE_quantifier_Intronic
- bam_processor The bam_processor module efficiently manages input BAM files by partitioning them into sub-BAM files for individual cells, distinguishing unique mapping from multi mapping reads. It also constructs TE-specific coverage vectors, shedding light on read distributions around TE instances at the single-cell level, enabling accurate TE quantification and comprehensive cellular characterization.
For simplicity, in data_mode, we use 10X to represent the format of data in which each BAM file contains reads from multiple cells, and Smart_seq to represent the type of data where individual BAM files contain reads from only one cell.
bam_processor.split_count_10X_data(TE_mode, sample_list_file, bam_path_file, bc_path_file, bc_ind='CB', ref_path = 'Default',num_threads=1)
# Parameters
## TE_mode : <str> exclusive or inclusive, represents whether remove TE instances have overlap with gene (for intronic, refer to below section)
## sample_list_file : <str> path to file conatins sample IDs
## bam_path_file : <str> path to file conatins matching bam file address of sample in sample list
## bc_path_file(optional) : <str> path to file contains matching barcodes list address of sample in sample list
## bc_ind:<str> barcode field indicator in bam files, e.g. CB/CR...
## ref_path(optional): <str> TE reference bed file. Only needed for self generated reference, provide path to reference. By default, exclusive have reference 'TE_nooverlap.bed' and inclusive have reference 'TE_full.bed'.
## num_threads(optional): <int> The number of process. By default it is 1. Increase the number of threads will reduce the running time, b
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
mentoring-juniors
Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
