DeepGenerativeModelLINCS

Deep Generative Models for Learning Gene Expression Profile Latent Representations from LINCS L1000 data

Generate Convert Improve

Install / Use

/learn @evasnow1992/DeepGenerativeModelLINCS

About this skill

Quality Score

0/100

README

Deep Generative Models on LINCS Data

Python code for the manuscript "Learning to Encode Cellular Responses to Systematic Perturbations with Deep Generative Models"

This repository provides Python code for preprocessing LINCS L1000 gene expression data and applying two deep generative models, Variational AutoEncoder (VAE) and Supervised Vector-Quantized Variational AutoEncoder (S-VQ-VAE) on the data for learning latent representations. VAE and S-VQ-VAE are implemented with Pytorch.

Required environment and packages

The code was tested on the following packages

numpy 1.16.2
matplotlib 3.0.3
pandas 0.23.4
cmapPy (for loading the LINCS datasets)
torch 0.4.1
sklearn 0.21.3
scipy 1.3.1
seaborn 0.9.0

Due to the requirement of cmapPy, the script should be run with Python2.7 to avoid compatibility issue.

Data

The LINCS data are available from the Gene Expression Omnibus (GEO) with accession codes GSE70138 and GSE70138 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138)

The perturbagen class (PCL) information of small molecule perturbagens can be found in the paper "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles" Supplementary Table S7.

Files

deep_generative_model_LINCS.ipynb

Code for preprocessing LINCS L1000 data and training VAE and S-VQ-VAE on three datasets:

SM dataset (SMP dataset in the manuscript): a subset of GEO dataset GSE70138, which contained the level 5 expression data (moderate z-scores) of the 978 landmark genes of 118,050 small-molecule-perturbed samples from 7 cell lines.

GP dataset: a subset of GEO dataset GSE106127, which contained the level 5 data of 119,013 gene-knocked-down samples from 9 cell lines.

Both dataset (SMGP dataset in the manuscript): a merge of the SM dataset and GP dataset while excluding 4,649 samples perturbed by two proteasome inhibitors, bortezomib, and MG-132.

deep_generative_model_LINCS_analysis.ipynb

Code for analyzing the latent representations of expression profiles learned with VAEs and S-VQ-VAEs. Analyses include:

Identify signature nodes from the top hidden layer of SMGP-trained VAE model encoder.
Compare the distribution of data generated with VAEs with real data.
Generate data to simulate expression profiles perturbed with small molecules from a given perturbagen class.
Classify PCL based on different latent representations of expression profiles.
Predict drug-targets with latent representations of expression profiles.
Reveal correlations between PCL global representations learned with S-VQ-VAE.

drug_gene_target_prediction_handle.ipynb

Concise code extracted from deep_generative_model_LINCS_analysis.ipynb for drug gene target prediction using VAE generated representations. Users may modify section 2.2 to search for drugs of interest. The known gene targets of each drug are required to be provided. The top10 ranks and mean rank of the top-ranked target gene across all drug perturbed samples are reported.

VAE_encode_SMP.pth, VAE_decode_SMP.pth, VAE_mu_SMP.pth, and VAE_logvar_SMP.pth

Pretrained VAE model on SMP dataset.

VAE_encode_GP.pth, VAE_decode_GP.pth, VAE_mu_GP.pth, and VAE_logvar_GP.pth

Pretrained VAE model on GP dataset.

VAE_encode_SMGP.pth, VAE_decode_SMGP.pth, VAE_mu_SMGP.pth, and VAE_logvar_SMGP.pth

Pretrained VAE model on SMGP dataset.

S_VQ_VAE_decode.pth, S_VQ_VAE_embedding.pth, and S_VQ_VAE_encode.pth

Pretrained S-VQ-VAE models for learning global representations for PCLs.

A tutorial of how to apply S-VQ-VAE on MNIST dataset is available at https://github.com/evasnow1992/S-VQ-VAE.

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

evasnow1992

View profile

View on GitHub

GitHub Stars9

CategoryEducation

Updated1y ago

Forks6

evasnow1992/DeepGenerativeModelLINCS

Languages

Jupyter Notebook

Security Score

70/100

Audited on Mar 7, 2025

No findings