Orchid
A novel management, annotation, and machine learning framework for analyzing cancer mutations
Install / Use
/learn @Wittelab/OrchidREADME
Installation and usage instructions can be found in the wiki.
<img src="images/orchid.png" alt="Orchid" height=150px; align="right">Orchid
A framework for cancer variant annotation, classification, analysis <br/>
Introduction
Please refer to the following publication for a detailed description of the software:
Bioinformatics, btx709, https://doi.org/10.1093/bioinformatics/btx709
What is Orchid?
The objective of Orchid is to facilitate meaningful biological and clinical interpretation of tumor genetic data through the use of machine learning. For example, Orchid could be used to classify aggressive vs. non-aggressive prostate cancer or determine the tissue-of-origin from the cell-free DNA molecules of a patient with cancer.
<br />What is a 'tumor mutational profile'?
In the Orchid framework, we define a tumor mutational profile as the annotated set of mutations within a tumor. A typical tumor might contain thousands of mutations. Most are presumed to be irrelevant to disease because they arise due to an important hallmark of cancer- an unstable genome. However, a crucial subset of these mutations is considered fundamental to carcinogenesis, or at least significantly involved, making them potential biomarkers for clinical classification (e.g. tumor aggressiveness). Orchid adopts a comprehensive approach to variant analysis, employing machine learning algorithms to collectively analyze all mutations. This methodology exposes nuanced mutational patterns and helps tease apart biological complexity.
<br />What is an 'annotated set of mutations'?
Annotations are numeric or categorical values that are associated with a particular mutation. For example, mutation 'A' may change the amino acid sequence of a protein, so we can annotate it with one category of amino acid consequences: a 'non-synonymous single nucleotide polymorphism' or 'nsSNP'. On the other hand, mutation 'B' may change a codon, but not the corresponding amino acid, so we would annotate it with another amino acid consequence category: a 'synonymous SNP'. Biologically speaking, nsSNPs are more likely to change the effect of a protein than a synonymous one. In the machine learning world, annotations like these are called features. If we gather many mutations across a tumor (or tumors) and annotate each mutation with many features, we end up with a set of annotated mutations, which we call a tumor mutational profile.
To date, many regulatory and coding features of the human genome have been cataloged. If we gather enough biological data to annotate mutations found in a tumor genome, we may be able to understand the mutational process in cancer. For development and publication of this software, we used quite a few public biological databases (see here; Note: This page is now archived). In practice, any can be used.
Here's an example of a mutational profile:

Mutations are arranged in rows and corresponding feature values in columns. The values here are normalized and colored white to orange (low to high). There is also a final column of sample labels, which is ultimately used for training and validation. NOTE: You may notice a lot of correlated feature vectors. Before training a ML model, its important to reduce feature correlation as much as possible!
Getting Started
- Download this code and install prerequisites
- Obtain tumor and annotation data
- Build the database
- Perform machine learning
Please refer to the wiki to begin!
NOTICE: This software requires the use of other code and/or data that must be obtained with respect to its license or copyright. Generally speaking, this implies Orchid's use is restricted to non-commercial activities. Orchid itself is licensed under the MIT license requiring only preservation of copyright and license notices. Please see the LICENSE file for more details.
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
last30days-skill
4.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
