BETA2
TF Binding and Expression Target Analysis 2
Install / Use
/learn @crazyhottommy/BETA2README
BETA2: Binding and Expression Target Analysis
BETA is a computational tool for integrative analysis of ChIP-seq and RNA-seq/microarray data to predict transcription factor (TF) direct target genes and identify whether the TF primarily functions as a transcriptional activator or repressor.
The Biological Problem
When you perform ChIP-seq to find where a transcription factor binds and RNA-seq to see which genes change expression, a critical question arises: Which genes are direct targets of your factor versus indirect/secondary effects?
Several challenges complicate this analysis:
- No 1-to-1 mapping: A single binding site can regulate multiple genes, and a gene can be regulated by multiple binding sites
- Not all binding is functional: Some ChIP-seq peaks may not actually regulate nearby genes due to lack of cofactors or unfavorable chromatin environment
- Secondary effects: Binding to direct target genes causes them to change expression, which then affects other genes downstream (indirect targets)
What BETA Does
BETA addresses these challenges by integrating binding and expression data to answer three key questions:
-
Is your factor an activator or repressor?
- Determines whether the factor primarily activates or represses gene expression by testing if genes with stronger binding potential are enriched among upregulated or downregulated genes
-
Which genes are direct targets?
- Identifies genes that are most likely to be directly regulated by combining two lines of evidence: proximity/strength of binding AND expression changes
- Genes with both high binding potential and differential expression are prioritized as direct targets
-
What cofactors modulate the factor's function? (optional)
- Identifies DNA-binding motifs enriched near your factor's binding sites to discover collaborating transcription factors
How BETA Works
Regulatory Potential Model: Instead of simply assigning the nearest gene to each peak, BETA calculates a "regulatory potential score" for each gene based on ALL nearby binding sites within a distance window (default 100kb). Binding sites closer to the transcription start site (TSS) contribute more to the score using an exponentially decaying distance function - this reflects the biological reality that closer regulatory elements generally have stronger effects.
Rank Product Integration: BETA ranks genes by two criteria:
- Regulatory potential score (how much binding is nearby)
- Differential expression significance (how much expression changed)
The rank product identifies genes that score well on BOTH criteria - these are the most confident direct targets. Genes that only show binding OR only show expression changes are deprioritized, reducing false positives from non-functional binding sites and indirect targets.
Statistical Testing: BETA uses the Kolmogorov-Smirnov test to determine if upregulated or downregulated genes have significantly higher regulatory potential scores than non-differentially expressed genes, revealing whether your factor functions as an activator, repressor, or both.
For Technical Details: See METHODOLOGY.md for a comprehensive step-by-step explanation of all calculations, formulas, and statistical tests with worked examples.
Key Features
- Integrative Analysis: Combines ChIP-seq peaks with gene expression data
- Regulatory Potential Scoring: Distance-weighted scoring system
- Statistical Assessment: Kolmogorov-Smirnov test and permutation-based FDR
- Motif Analysis: Optional motif scanning and enrichment analysis
- Multiple Input Formats: Supports LIMMA, Cuffdiff, and custom formats
- Genome Support: Human (hg38, hg19, hg18) and Mouse (mm10, mm9)
Installation
Requirements
- Python 3.8 or higher
- C compiler (gcc) for motif scanning module
From PyPI (Recommended)
pip install beta-binding-analysis
From Source
git clone https://github.com/crazyhottommy/BETA2.git
cd BETA2
pip install -e .
Quick Start
Basic Analysis
Predict TF target genes and function (activator/repressor):
# Note: diff_expr.txt should be TF activation/stimulation vs control (not knockdown)
beta basic \
-p peaks.bed \
-e diff_expr.txt \
-k LIM \
-g hg38 \
-n my_experiment \
-o output_dir
Plus Mode (with Motif Analysis)
Include motif analysis:
beta plus \
-p peaks.bed \
-e diff_expr.txt \
-k LIM \
-g hg38 \
--gs hg38.fa \
-n my_experiment \
-o output_dir
Minus Mode (Peaks Only)
Analyze binding data without expression data:
beta minus \
-p peaks.bed \
-g hg38 \
-n my_experiment \
-o output_dir
Example with Test Data
Using the included test data for estrogen receptor (ER/ESR1):
beta basic \
-p BETA_test_data/ER_349_peaks.bed \
-e BETA_test_data/ESR1_diff_expr.xls \
-k O \
--info 1,2,6 \
-g hg19 \
-n ESR1
Note: When using custom format (-k O), the --info parameter specifies column positions: gene ID, log fold change, and statistical value (e.g., 1,2,6 for columns 1, 2, and 6). See Input Files section below for format details.
Input Files
ChIP-seq Peaks (required)
BED format file (3 or 5 columns):
chr1 1000 2000
chr1 5000 6000
Peak Number Cutoff - IMPORTANT:
By default, BETA only uses the top 10,000 peaks even if your BED file contains more peaks. This is controlled by the --pn parameter.
Why this default exists:
- Focus on high-confidence peaks (strongest peaks are most likely functional)
- Reduce computational time
- Reduce noise from weak/low-confidence binding events
When to adjust:
- Use default (10,000): For most analyses, or if you have noisy ChIP-seq data
- Increase the number: If you have high-quality ChIP-seq with many strong peaks and want comprehensive analysis
beta basic -p peaks.bed -e expr.txt -k LIM -g hg38 -n my_TF --pn 34000 - Use all peaks: Set to a very high number to ensure all peaks are included
beta basic -p peaks.bed -e expr.txt -k LIM -g hg38 -n my_TF --pn 100000 - Decrease: If you want to focus only on the very strongest binding events
beta basic -p peaks.bed -e expr.txt -k LIM -g hg38 -n my_TF --pn 5000
Example: If your BED file contains 34,000 peaks but you don't specify --pn, only the top 10,000 peaks (by score, if available) will be used in the analysis.
Differential Expression (required for basic/plus modes)
Experimental Design - IMPORTANT:
The differential expression file should represent gene expression changes from activating/stimulating your transcription factor:
-
Preferred: TF activation/overexpression/stimulation vs control
- Example: AR activation (androgen treatment) vs vehicle control
- Example: ESR1 activation (estrogen treatment) vs vehicle control
- Example: TF overexpression vs empty vector control
-
If you only have knockdown/knockout/inhibition data: You can use it by flipping the sign of log2FC values
- If you have: TF knockdown vs control
- Simply multiply all log2FC values by -1
- Example: Gene X has log2FC = -2.5 in knockdown → use log2FC = +2.5 for BETA
- Keep p-values and FDR unchanged (only flip the fold change sign)
- Biological interpretation: If knocking down the TF decreases a gene's expression, that gene is likely activated by the TF (so activating the TF would increase it)
Why this matters: BETA determines whether your TF is an activator or repressor by testing if genes with high binding potential are enriched among upregulated or downregulated genes. This logic requires that you provide the "TF activation" comparison. If your factor is an activator, activating it will increase expression of direct targets (positive logFC). If it's a repressor, activating it will decrease expression of direct targets (negative logFC).
Gene ID Format:
BETA supports two types of gene identifiers in differential expression files:
-
RefSeq IDs (default): Use transcript/gene accessions like
NM_001002231,NR_045762,XM_012345- This is the default behavior - no additional flag needed
- Example:
NM_001002231,NR_045762_at
-
Official Gene Symbols: Use gene names like
TP53,MYC,BRCA1- Requires the
--gname2flag - Example command:
beta basic -p peaks.bed -e expression.txt -k LIM --gname2 -g hg38 -n experiment
- Requires the
Important: All genes in your differential expression file must use the SAME identifier type (either all RefSeq IDs or all gene symbols). Do not mix them.
Supported formats:
-
LIMMA (
-k LIM): Standard LIMMA output -
Cuffdiff (
-k CUF): Cuffdiff gene_exp.diff format -
BETA Standard Format (
-k BSF):GeneSymbol log2FoldChange FDR TP53 2.5 0.001 MYC -1.8 0.01 -
Other/Custom (
-k O): Custom tab-delimited format with--infoto specify columnsIMPORTANT: The first line must start with
#to be treated as a header. Without the#, BETA will try to parse the header as data and fail.Example with RefSeq IDs (columns 1, 2, 6 contain gene ID, logFC, adj.P.Val):
#ID logFC AveExpr t P.Value adj.P.Val NM_001002231 3.21 9.17 35.33 8.07e-11 4.18e-07 NM_005551 2.14 8.45 28.15
