BETA2: Binding and Expression Target Analysis

BETA is a computational tool for integrative analysis of ChIP-seq and RNA-seq/microarray data to predict transcription factor (TF) direct target genes and identify whether the TF primarily functions as a transcriptional activator or repressor.

The Biological Problem

When you perform ChIP-seq to find where a transcription factor binds and RNA-seq to see which genes change expression, a critical question arises: Which genes are direct targets of your factor versus indirect/secondary effects?

Several challenges complicate this analysis:

No 1-to-1 mapping: A single binding site can regulate multiple genes, and a gene can be regulated by multiple binding sites
Not all binding is functional: Some ChIP-seq peaks may not actually regulate nearby genes due to lack of cofactors or unfavorable chromatin environment
Secondary effects: Binding to direct target genes causes them to change expression, which then affects other genes downstream (indirect targets)

What BETA Does

BETA addresses these challenges by integrating binding and expression data to answer three key questions:

Is your factor an activator or repressor?
- Determines whether the factor primarily activates or represses gene expression by testing if genes with stronger binding potential are enriched among upregulated or downregulated genes
Which genes are direct targets?
- Identifies genes that are most likely to be directly regulated by combining two lines of evidence: proximity/strength of binding AND expression changes
- Genes with both high binding potential and differential expression are prioritized as direct targets
What cofactors modulate the factor's function? (optional)
- Identifies DNA-binding motifs enriched near your factor's binding sites to discover collaborating transcription factors

How BETA Works

Regulatory Potential Model: Instead of simply assigning the nearest gene to each peak, BETA calculates a "regulatory potential score" for each gene based on ALL nearby binding sites within a distance window (default 100kb). Binding sites closer to the transcription start site (TSS) contribute more to the score using an exponentially decaying distance function - this reflects the biological reality that closer regulatory elements generally have stronger effects.

Rank Product Integration: BETA ranks genes by two criteria:

Regulatory potential score (how much binding is nearby)
Differential expression significance (how much expression changed)

The rank product identifies genes that score well on BOTH criteria - these are the most confident direct targets. Genes that only show binding OR only show expression changes are deprioritized, reducing false positives from non-functional binding sites and indirect targets.

Statistical Testing: BETA uses the Kolmogorov-Smirnov test to determine if upregulated or downregulated genes have significantly higher regulatory potential scores than non-differentially expressed genes, revealing whether your factor functions as an activator, repressor, or both.

For Technical Details: See METHODOLOGY.md for a comprehensive step-by-step explanation of all calculations, formulas, and statistical tests with worked examples.

Key Features

Integrative Analysis: Combines ChIP-seq peaks with gene expression data
Regulatory Potential Scoring: Distance-weighted scoring system
Statistical Assessment: Kolmogorov-Smirnov test and permutation-based FDR
Motif Analysis: Optional motif scanning and enrichment analysis
Multiple Input Formats: Supports LIMMA, Cuffdiff, and custom formats
Genome Support: Human (hg38, hg19, hg18) and Mouse (mm10, mm9)

Installation

Requirements

Python 3.8 or higher
C compiler (gcc) for motif scanning module

From PyPI (Recommended)

pip install beta-binding-analysis

From Source

git clone https://github.com/crazyhottommy/BETA2.git
cd BETA2
pip install -e .

Quick Start

Basic Analysis

Predict TF target genes and function (activator/repressor):

# Note: diff_expr.txt should be TF activation/stimulation vs control (not knockdown)
beta basic \
  -p peaks.bed \
  -e diff_expr.txt \
  -k LIM \
  -g hg38 \
  -n my_experiment \
  -o output_dir

Plus Mode (with Motif Analysis)

Include motif analysis:

beta plus \
  -p peaks.bed \
  -e diff_expr.txt \
  -k LIM \
  -g hg38 \
  --gs hg38.fa \
  -n my_experiment \
  -o output_dir

Minus Mode (Peaks Only)

Analyze binding data without expression data:

beta minus \
  -p peaks.bed \
  -g hg38 \
  -n my_experiment \
  -o output_dir

Example with Test Data

Using the included test data for estrogen receptor (ER/ESR1):

beta basic \
  -p BETA_test_data/ER_349_peaks.bed \
  -e BETA_test_data/ESR1_diff_expr.xls \
  -k O \
  --info 1,2,6 \
  -g hg19 \
  -n ESR1

Note: When using custom format (-k O), the --info parameter specifies column positions: gene ID, log fold change, and statistical value (e.g., 1,2,6 for columns 1, 2, and 6). See Input Files section below for format details.

Input Files

ChIP-seq Peaks (required)

BED format file (3 or 5 columns):

chr1    1000    2000
chr1    5000    6000

Peak Number Cutoff - IMPORTANT:

By default, BETA only uses the top 10,000 peaks even if your BED file contains more peaks. This is controlled by the --pn parameter.

Why this default exists:

Focus on high-confidence peaks (strongest peaks are most likely functional)
Reduce computational time
Reduce noise from weak/low-confidence binding events

When to adjust:

Use default (10,000): For most analyses, or if you have noisy ChIP-seq data
Increase the number: If you have high-quality ChIP-seq with many strong peaks and want comprehensive analysis
```
beta basic -p peaks.bed -e expr.txt -k LIM -g hg38 -n my_TF --pn 34000
```

Use all peaks: Set to a very high number to ensure all peaks are included

beta basic -p peaks.bed -e expr.txt -k LIM -g hg38 -n my_TF --pn 100000

Decrease: If you want to focus only on the very strongest binding events
```
beta basic -p peaks.bed -e expr.txt -k LIM -g hg38 -n my_TF --pn 5000
```

Example: If your BED file contains 34,000 peaks but you don't specify --pn, only the top 10,000 peaks (by score, if available) will be used in the analysis.

Differential Expression (required for basic/plus modes)

Experimental Design - IMPORTANT:

The differential expression file should represent gene expression changes from activating/stimulating your transcription factor:

Preferred: TF activation/overexpression/stimulation vs control
- Example: AR activation (androgen treatment) vs vehicle control
- Example: ESR1 activation (estrogen treatment) vs vehicle control
- Example: TF overexpression vs empty vector control
If you only have knockdown/knockout/inhibition data: You can use it by flipping the sign of log2FC values
- If you have: TF knockdown vs control
- Simply multiply all log2FC values by -1
- Example: Gene X has log2FC = -2.5 in knockdown → use log2FC = +2.5 for BETA
- Keep p-values and FDR unchanged (only flip the fold change sign)
- Biological interpretation: If knocking down the TF decreases a gene's expression, that gene is likely activated by the TF (so activating the TF would increase it)

Why this matters: BETA determines whether your TF is an activator or repressor by testing if genes with high binding potential are enriched among upregulated or downregulated genes. This logic requires that you provide the "TF activation" comparison. If your factor is an activator, activating it will increase expression of direct targets (positive logFC). If it's a repressor, activating it will decrease expression of direct targets (negative logFC).

Gene ID Format:

BETA supports two types of gene identifiers in differential expression files:

RefSeq IDs (default): Use transcript/gene accessions like NM_001002231, NR_045762, XM_012345
- This is the default behavior - no additional flag needed
- Example: NM_001002231, NR_045762_at
Official Gene Symbols: Use gene names like TP53, MYC, BRCA1
- Requires the --gname2 flag
- Example command: beta basic -p peaks.bed -e expression.txt -k LIM --gname2 -g hg38 -n experiment

Important: All genes in your differential expression file must use the SAME identifier type (either all RefSeq IDs or all gene symbols). Do not mix them.

Supported formats:

LIMMA (-k LIM): Standard LIMMA output
Cuffdiff (-k CUF): Cuffdiff gene_exp.diff format

BETA Standard Format (-k BSF):

GeneSymbol    log2FoldChange    FDR
TP53          2.5               0.001
MYC           -1.8              0.01

Other/Custom (-k O): Custom tab-delimited format with --info to specify columns

IMPORTANT: The first line must start with # to be treated as a header. Without the #, BETA will try to parse the header as data and fail.

Example with RefSeq IDs (columns 1, 2, 6 contain gene ID, logFC, adj.P.Val):
```
#ID              logFC    AveExpr    t           P.Value      adj.P.Val
NM_001002231     3.21     9.17       35.33       8.07e-11     4.18e-07
NM_005551        2.14     8.45       28.15   
```

BETA2

Install / Use

README