CTCF
Genomic coordinates of FIMO-predicted CTCF binding sites using JASPAR and other PWMs, human and mouse genome assemblies including mm39 and T2T. Also included experimentally derived ENCODE SCREEN CTCF-bound CREs.
Install / Use
/learn @mdozmorov/CTCFREADME
CTCF
<!-- [](https://bioconductor.org/checkResults/release/bioc-LATEST/CTCF) [](https://github.com/mdozmorov/CTCF/actions/workflows/R-CMD-check-bioc.yaml) -->CTCF defines an AnnotationHub resource representing genomic coordinates of FIMO-predicted CTCF binding sites for human and mouse genomes, including the Telomere-to-Telomere and mm39 genome assemblies. It also includes experimentally defined CTCF-bound cis-regulatory elements from ENCODE SCREEN.
TL;DR - for human hg38 genome assembly, use hg38.MA0139.1.RData
(“AH104729”). For mouse mm10 genome assembly, use mm10.MA0139.1.RData
(“AH104755”). For ENCODE SCREEN
data, use hg38.SCREEN.GRCh38_CTCF.RData (“AH104730”) or
mm10.SCREEN.mm10_CTCF.RData (“AH104756”) objects.
The CTCF GRanges are named as <assembly>.<Database>. The
FIMO-predicted data includes extra columns with motif name, score,
p-value, q-value, and the motif sequence.
Installation instructions
Install the latest release of R, then get the latest version of Bioconductor by starting R and entering the commands:
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install(version = "3.16")
Then, install additional packages using the following code:
# BiocManager::install("AnnotationHub", update = FALSE)
# BiocManager::install("GenomicRanges", update = FALSE)
# BiocManager::install("plyranges", update = FALSE)
Example
suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
query_data <- subset(ah, preparerclass == "CTCF")
# Explore the AnnotationHub object
query_data
#> AnnotationHub with 51 records
#> # snapshotDate(): 2024-04-30
#> # $dataprovider: JASPAR 2022, CTCFBSDB 2.0, SwissRegulon, Jolma 2013, HOCOMO...
#> # $species: Homo sapiens, Mus musculus
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH104716"]]'
#>
#> title
#> AH104716 | T2T.CIS_BP_2.00_Homo_sapiens.RData
#> AH104717 | T2T.CTCFBSDB_PWM.RData
#> AH104718 | T2T.HOCOMOCOv11_core_HUMAN_mono_meme_format.RData
#> AH104719 | T2T.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#> AH104720 | T2T.Jolma2013.RData
#> ... ...
#> AH104762 | mm9.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#> AH104763 | mm9.Jolma2013.RData
#> AH104764 | mm9.MA0139.1.RData
#> AH104765 | mm9.SwissRegulon_human_and_mouse.RData
#> AH104766 | mm8.CTCFBSDB.CTCF_predicted_mouse.RData
# Get the list of data providers
query_data$dataprovider %>% table()
#> .
#> CIS-BP CTCFBSDB 2.0 ENCODE SCREEN v3 HOCOMOCO v11
#> 6 12 2 6
#> JASPAR 2022 Jolma 2013 SwissRegulon
#> 13 6 6
We can find CTCF sites identified using JASPAR 2022 database in hg38 human genome
subset(query_data, species == "Homo sapiens" &
genome == "hg38" &
dataprovider == "JASPAR 2022")
#> AnnotationHub with 2 records
#> # snapshotDate(): 2024-04-30
#> # $dataprovider: JASPAR 2022
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH104727"]]'
#>
#> title
#> AH104727 | hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#> AH104729 | hg38.MA0139.1.RData
# Same for mm10 mouse genome
# subset(query_data, species == "Mus musculus" & genome == "mm10" & dataprovider == "JASPAR 2022")
The hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2 object contains
CTCF sites detected using the all three CTCF
PWMs.
To retrieve, we’ll use:
# hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2
CTCF_hg38_all <- query_data[["AH104727"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("GenomicRanges")
#> Warning: package 'S4Vectors' was built under R version 4.4.1
#> Warning: package 'IRanges' was built under R version 4.4.1
CTCF_hg38_all
#> GRanges object with 3093041 ranges and 5 metadata columns:
#> seqnames ranges strand | name
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr1 11212-11246 + | JASPAR2022_CORE_vert..
#> [2] chr1 11399-11432 + | JASPAR2022_CORE_vert..
#> [3] chr1 11414-11432 + | JASPAR2022_CORE_vert..
#> [4] chr1 12373-12406 + | JASPAR2022_CORE_vert..
#> [5] chr1 13507-13541 + | JASPAR2022_CORE_vert..
#> ... ... ... ... . ...
#> [3093037] chrY 57215115-57215148 - | JASPAR2022_CORE_vert..
#> [3093038] chrY 57215146-57215164 - | JASPAR2022_CORE_vert..
#> [3093039] chrY 57215146-57215179 - | JASPAR2022_CORE_vert..
#> [3093040] chrY 57215332-57215366 - | JASPAR2022_CORE_vert..
#> [3093041] chrY 57216319-57216352 - | JASPAR2022_CORE_vert..
#> score pvalue qvalue sequence
#> <numeric> <numeric> <numeric> <character>
#> [1] 7.77064 5.25e-05 0.459 gtgctgtgccagggcgcccc..
#> [2] 18.48780 2.54e-07 0.118 cagcacgcccacctgctggc..
#> [3] 9.11475 5.65e-05 0.555 ctggcagctggggacactg
#> [4] 9.21951 5.25e-05 0.421 CAGCAGGTCTGGCTTTGGCC..
#> [5] 9.71560 2.24e-05 0.397 GTGCCCTTCCTTTGCTCTGC..
#> ... ... ... ... ...
#> [3093037] 8.06504 9.11e-05 0.614 CTGCTGGGCCCTCTTGCTCC..
#> [3093038] 9.11475 5.65e-05 0.726 CTGGCAGCTGGGGACACTG
#> [3093039] 17.91870 3.72e-07 0.246 CAGCACGCCCGCCTGCTGGC..
#> [3093040] 7.77064 5.25e-05 0.584 GTGCTGTGCCAGGGCGCCCC..
#> [3093041] 8.63415 6.96e-05 0.595 CTGCATTTGCGTTCCGACGC..
#> -------
#> seqinfo: 24 sequences from hg38 genome
The hg38.MA0139.1 object contains CTCF sites detected using the most
popular MA0139.1 CTCF
PWM. To retrieve:
# hg38.MA0139.1
CTCF_hg38 <- query_data[["AH104729"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
CTCF_hg38
#> GRanges object with 887980 ranges and 5 metadata columns:
#> seqnames ranges strand | name score pvalue
#> <Rle> <IRanges> <Rle> | <character> <numeric> <numeric>
#> [1] chr1 11414-11432 + | MA0139.1 9.11475 5.65e-05
#> [2] chr1 14316-14334 + | MA0139.1 7.83607 9.71e-05
#> [3] chr1 15439-15457 + | MA0139.1 8.00000 9.08e-05
#> [4] chr1 16603-16621 + | MA0139.1 8.04918 8.89e-05
#> [5] chr1 16651-16669 + | MA0139.1 11.42620 1.97e-05
#> ... ... ... ... . ... ... ...
#> [887976] chrY 57209918-57209936 - | MA0139.1 11.42620 1.97e-05
#> [887977] chrY 57209966-57209984 - | MA0139.1 8.04918 8.89e-05
#> [887978] chrY 57211133-57211151 - | MA0139.1 8.00000 9.08e-05
#> [887979] chrY 57212256-57212274 - | MA0139.1 7.83607 9.71e-05
#> [887980] chrY 57215146-57215164 - | MA0139.1 9.11475 5.65e-05
#> qvalue sequence
#> <numeric> <character>
#> [1] 0.555 ctggcagctggggacactg
#> [2] 0.601 GGACCAACAGGGGCAGGAG
#> [3] 0.599 TAGCCTCCAGAGGCCTCAG
#> [4] 0.597 CCACCTGAAGGAGACGCGC
#> [5] 0.504 TGGCCTACAGGGGCCGCGG
#> ... ... ...
#> [887976] 0.648 TGGCCTACAGGGGCCGCGG
#> [887977] 0.770 CCACCTGAAGGAGACGCGC
#> [887978] 0.770 TAGCCTCCAGAGGCCTCAG
#> [887979] 0.770 GGACCAACAGGGGCAGGAG
#> [887980] 0.726 CTGGCAGCTGGGGACACTG
#> -------
#> seqinfo: 24 sequences from hg38 genome
It is always advisable to sort GRanges objects and keep stan
