HDMC

HDMC (Hierarchical Distribution Matching and Contrastive learning) is a novel deep learning based framework for batch effect removal in scRNA-seq data.

Generate Convert Improve

Install / Use

/learn @zhanglabNKU/HDMC

About this skill

Quality Score

0/100

README

HDMC: Hierarchical Distribution Matching and Contrastive learning

Code and data for using HDMC, a novel deep learning based framework for batch effect removal in scRNA-seq data.

Install

git clone https://github.com/zhanglabNKU/HDMC.git
cd HDMC/

R Dependencies

Seurat 2.3.0

Python Dependencies

Python 3.7.7
scikit-learn 0.23.2
pytorch 1.3.1
imbalanced-learn 0.7.0
rpy2 2.9.4
universal-divergence 0.2.0
pandas 1.0.4

Usage

Given several datasets (each treated as a batch) for combination, there are two main steps: (i) preprocess the datasets and run metaneighbor algorithm to compute cluster similarities; (ii) train an HDMC model for batch correction.

Data preprocessing

Run the R script pre_processing.R as follows:

Rscript pre_processing.R folder_name file1 file2 ...

For example:

Rscript pre_processing.R example batch1.csv batch2.csv

The two datasets batch1.csv and batch2.csv (must be csv form) will be processed by the script and you will get three files saved in the same folder: the processed data named batch1_seurat.csv and batch2_seurat.csv, a file named metaneighbor.csv containing values of the cluster similarities between different batches.

Batch correction

Run the python script hdmc.py to combine the datasets and remove batch effects as follows:

python hdmc.py -data_folder folder -files file1 file2 ... -h_thr thr1 -l_thr thr2

For example:

python hdmc.py -data_folder example/ -files batch1_seurat.csv batch2_seurat.csv -h_thr 0.9 -l_thr 0.5

This command will train an HDMC model for the selected files in the data_folder with two thresholds (-h_thr is the higher threshold and -l_thr is the lower one). When the training is finished, the datasets will be combined without batch effectes and the result file named combined.csv will be saved in the same data folder.

In addition, some optional parameters are also available:

-num_epochs: number of the training epochs (default=2000)
-code_dim: dimension of the embedded code (default=20)
-base_lr: base learning rate for network training (default=1e-3)
-lr_step: step decay of learning rates (default=200)
-gamma: hyperparameter for adversarial learning (default=1)

Under most circumstances, you don't need to change the optional parameters.

Use the help command to print all the options:

python hdmc.py --help

Data availability

The download links of all the datasets are given in the folder named data.

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

sec-edgar-agentkit

AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.

Kiln

4.7k

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.