HDMC
HDMC (Hierarchical Distribution Matching and Contrastive learning) is a novel deep learning based framework for batch effect removal in scRNA-seq data.
Install / Use
/learn @zhanglabNKU/HDMCREADME
HDMC: Hierarchical Distribution Matching and Contrastive learning
Code and data for using HDMC, a novel deep learning based framework for batch effect removal in scRNA-seq data.
Install
git clone https://github.com/zhanglabNKU/HDMC.git
cd HDMC/
R Dependencies
- Seurat 2.3.0
Python Dependencies
- Python 3.7.7
- scikit-learn 0.23.2
- pytorch 1.3.1
- imbalanced-learn 0.7.0
- rpy2 2.9.4
- universal-divergence 0.2.0
- pandas 1.0.4
Usage
Given several datasets (each treated as a batch) for combination, there are two main steps: (i) preprocess the datasets and run metaneighbor algorithm to compute cluster similarities; (ii) train an HDMC model for batch correction.
Data preprocessing
Run the R script pre_processing.R as follows:
Rscript pre_processing.R folder_name file1 file2 ...
For example:
Rscript pre_processing.R example batch1.csv batch2.csv
The two datasets batch1.csv and batch2.csv (must be csv form) will be processed by the script and you will get three files saved in the same folder: the processed data named batch1_seurat.csv and batch2_seurat.csv, a file named metaneighbor.csv containing values of the cluster similarities between different batches.
Batch correction
Run the python script hdmc.py to combine the datasets and remove batch effects as follows:
python hdmc.py -data_folder folder -files file1 file2 ... -h_thr thr1 -l_thr thr2
For example:
python hdmc.py -data_folder example/ -files batch1_seurat.csv batch2_seurat.csv -h_thr 0.9 -l_thr 0.5
This command will train an HDMC model for the selected files in the data_folder with two thresholds (-h_thr is the higher threshold and -l_thr is the lower one). When the training is finished, the datasets will be combined without batch effectes and the result file named combined.csv will be saved in the same data folder.
In addition, some optional parameters are also available:
-num_epochs: number of the training epochs (default=2000)-code_dim: dimension of the embedded code (default=20)-base_lr: base learning rate for network training (default=1e-3)-lr_step: step decay of learning rates (default=200)-gamma: hyperparameter for adversarial learning (default=1)
Under most circumstances, you don't need to change the optional parameters.
Use the help command to print all the options:
python hdmc.py --help
Data availability
The download links of all the datasets are given in the folder named data.
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
Kiln
4.7kBuild, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
