MATES

A Deep Learning-Based Model for Quantifying Transposable Elements in Single-Cell Sequencing Data

Generate Convert Improve

Install / Use

/learn @mcgilldinglab/MATES

About this skill

Quality Score

0/100

README

MATES

A Deep Learning-Based Model for Quantifying Transposable Elements in Single-Cell Sequencing Data (Nature Communications, 2024).

Citation

If you use MATES in your research, please cite MATES publication as follows:

Wang, R., Zheng, Y., Zhang, Z. et al. MATES: a deep learning-based model for locus-specific quantification of transposable elements in single cell. Nat Commun 15, 8798 (2024). https://doi.org/10.1038/s41467-024-53114-7

Overview

Transposable elements (TEs) are crucial for genetic diversity and gene regulation. Current single-cell quantification methods often align multi-mapping reads to either ‘best-mapped’ or ‘random-mapped’ locations and categorize them at the subfamily levels, overlooking the biological necessity for accurate, locus-specific TE quantification. Moreover, these existing methods are primarily designed for and focused on transcriptomics data, which restricts their adaptability to single-cell data of other modalities. To address these challenges, here we introduce MATES, a deep-learning approach that accurately allocates multi-mapping reads to specific loci of TEs, utilizing context from adjacent read alignments flanking the TE locus. When applied to diverse single-cell omics datasets, MATES shows improved performance over existing methods, enhancing the accuracy of TE quantification and aiding in the identification of marker TEs for identified cell populations. This development facilitates the exploration of single-cell heterogeneity and gene regulation through the lens of TEs, offering an effective transposon quantification tool for the single-cell genomics community.

Relesae Note

Version 0.1.8: Support TE quantification for data without sufficient multi-mapping TE reads.
Version 0.1.7: Parallelize preprocessing for 10X-format data.
Version 0.1.6: Add a simple mode for MATES to quantify TE within 3 lines of code. Add a common errors Q&A.
Version 0.1.5: Improve the efficiency of splitting BAM files and counting TEs reads.

MATES is actively under development; please feel free to reach out if you encounter any issue.

Installation

Installing MATES

To install MATES, you can run the following command:

# Clone the MATES repository
git clone https://github.com/mcgilldinglab/MATES.git

# Create a new environment
conda create -n mates_env python=3.9
conda activate mates_env

# Install required packages
conda install -c bioconda samtools -y
conda install -c bioconda bedtools -y

# Install MATES
cd MATES
pip install .

# Add environment to Jupyter Notebook
conda install ipykernel
python -m ipykernel install --user --name=mates_env

Installation should take only a few minutes. Verify installation:

import MATES

Usage

Simple mode

Use the all in one MATES_pipeline. Please read our Examples and APIs for details.

from MATES import MATES_pipeline
mates = MATES_pipeline(TE_mode, data_mode, sample_list_file, bam_path_file) #set up parameters
mates.preprocessing() #Preprocessing
mates.run() #train model and quantify both subfamily and locus-level TE expression

Advanced mode

Count coverage vector, Determine U/M region, Generate training and prediction data, Train models, Quantify sub-family level TEs, and Quantify locus_level TEs step by step. Please read our Examples and tutorials.

Common Q&A

If you encounter errors when using MATES, please read our common Q&A.

Tutorials

Customize the reference genome for the species of interest

We have supported automatic human and mouse TE/Gene reference genome creating using python build_reference.py --species Human/Mouse. For Arabidopsis thaliana and Drosophila melanogaster, please visit the shared folder for GTF file, RepeatMaskers file, and example script to create their TE/Gene reference genome. For other species, please refer to the tutorial of building TE and Gene reference genome.

Walkthrough Example

From loading data to downstream analysis. Please refer to Example Section for deatils.

Example scripts for different type of single cell data

10x scRNA-seq dataset

MATES downstream analysis on 10X scRNA data

Smart-seq2 scRNA dataset

10x scATAC-seq dataset

MATES downstream analysis on 10X scATAC data

APIs

The MATES contains six modules.

import MATES
from MATES import bam_processor
from MATES import data_processor
from MATES import MATES_model
from MATES import TE_quantifier
from MATES import TE_quantifier_LongRead
from MATES import TE_quantifier_Intronic

bam_processor The bam_processor module efficiently manages input BAM files by partitioning them into sub-BAM files for individual cells, distinguishing unique mapping from multi mapping reads. It also constructs TE-specific coverage vectors, shedding light on read distributions around TE instances at the single-cell level, enabling accurate TE quantification and comprehensive cellular characterization.

For simplicity, in data_mode, we use 10X to represent the format of data in which each BAM file contains reads from multiple cells, and Smart_seq to represent the type of data where individual BAM files contain reads from only one cell.

bam_processor.split_count_10X_data(TE_mode, sample_list_file, bam_path_file, bc_path_file, bc_ind='CB', ref_path = 'Default',num_threads=1)
# Parameters
## TE_mode : <str> exclusive or inclusive, represents whether remove TE instances have overlap with gene (for intronic, refer to below section)
## sample_list_file : <str> path to file conatins sample IDs
## bam_path_file : <str> path to file conatins matching bam file address of sample in sample list
## bc_path_file(optional) : <str> path to file contains matching barcodes list address of sample in sample list
## bc_ind:<str> barcode field indicator in bam files, e.g. CB/CR...
## ref_path(optional): <str> TE reference bed file. Only needed for self generated reference, provide path to reference. By default, exclusive have reference 'TE_nooverlap.bed' and inclusive have reference 'TE_full.bed'.
## num_threads(optional):  <int> The number of process. By default it is 1. Increase the number of threads will reduce the running time, b

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

mentoring-juniors

Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

mcgilldinglab

View profile

View on GitHub

GitHub Stars32

CategoryEducation

Updated3mo ago

Forks5

mcgilldinglab/MATES

Languages

Jupyter Notebook

Security Score

87/100

Audited on Dec 11, 2025

No findings