SkillAgentSearch skills...

CTCF

Genomic coordinates of FIMO-predicted CTCF binding sites using JASPAR and other PWMs, human and mouse genome assemblies including mm39 and T2T. Also included experimentally derived ENCODE SCREEN CTCF-bound CREs.

Install / Use

/learn @mdozmorov/CTCF
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

CTCF

Lifecycle:
stable

<!-- [![BioC status](http://www.bioconductor.org/shields/build/release/bioc/CTCF.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/CTCF) [![R-CMD-check-bioc](https://github.com/mdozmorov/CTCF/actions/workflows/R-CMD-check-bioc.yaml/badge.svg)](https://github.com/mdozmorov/CTCF/actions/workflows/R-CMD-check-bioc.yaml) -->

CTCF defines an AnnotationHub resource representing genomic coordinates of FIMO-predicted CTCF binding sites for human and mouse genomes, including the Telomere-to-Telomere and mm39 genome assemblies. It also includes experimentally defined CTCF-bound cis-regulatory elements from ENCODE SCREEN.

TL;DR - for human hg38 genome assembly, use hg38.MA0139.1.RData (“AH104729”). For mouse mm10 genome assembly, use mm10.MA0139.1.RData (“AH104755”). For ENCODE SCREEN data, use hg38.SCREEN.GRCh38_CTCF.RData (“AH104730”) or mm10.SCREEN.mm10_CTCF.RData (“AH104756”) objects.

The CTCF GRanges are named as <assembly>.<Database>. The FIMO-predicted data includes extra columns with motif name, score, p-value, q-value, and the motif sequence.

<!-- **Please, note that the updated CTCF objects will be available in Bioconductor/AnnotationHub 3.16.** To test the following code, use the `bioconductor::devel` Docker image. Run: &#10;``` bash docker run -e PASSWORD=password -p 8787:8787 -d --rm -v $(pwd):/home/rstudio bioconductor/bioconductor_docker:devel ``` Open http://localhost:8787 and login using `rstudio/password` credentials. -->

Installation instructions

Install the latest release of R, then get the latest version of Bioconductor by starting R and entering the commands:

# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(version = "3.16")

Then, install additional packages using the following code:

# BiocManager::install("AnnotationHub", update = FALSE) 
# BiocManager::install("GenomicRanges", update = FALSE)
# BiocManager::install("plyranges", update = FALSE)

Example

suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
query_data <- subset(ah, preparerclass == "CTCF")
# Explore the AnnotationHub object
query_data
#> AnnotationHub with 51 records
#> # snapshotDate(): 2024-04-30
#> # $dataprovider: JASPAR 2022, CTCFBSDB 2.0, SwissRegulon, Jolma 2013, HOCOMO...
#> # $species: Homo sapiens, Mus musculus
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH104716"]]' 
#> 
#>              title                                                 
#>   AH104716 | T2T.CIS_BP_2.00_Homo_sapiens.RData                    
#>   AH104717 | T2T.CTCFBSDB_PWM.RData                                
#>   AH104718 | T2T.HOCOMOCOv11_core_HUMAN_mono_meme_format.RData     
#>   AH104719 | T2T.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#>   AH104720 | T2T.Jolma2013.RData                                   
#>   ...        ...                                                   
#>   AH104762 | mm9.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#>   AH104763 | mm9.Jolma2013.RData                                   
#>   AH104764 | mm9.MA0139.1.RData                                    
#>   AH104765 | mm9.SwissRegulon_human_and_mouse.RData                
#>   AH104766 | mm8.CTCFBSDB.CTCF_predicted_mouse.RData
# Get the list of data providers
query_data$dataprovider %>% table()
#> .
#>           CIS-BP     CTCFBSDB 2.0 ENCODE SCREEN v3     HOCOMOCO v11 
#>                6               12                2                6 
#>      JASPAR 2022       Jolma 2013     SwissRegulon 
#>               13                6                6

We can find CTCF sites identified using JASPAR 2022 database in hg38 human genome

subset(query_data, species == "Homo sapiens" & 
                   genome == "hg38" & 
                   dataprovider == "JASPAR 2022")
#> AnnotationHub with 2 records
#> # snapshotDate(): 2024-04-30
#> # $dataprovider: JASPAR 2022
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH104727"]]' 
#> 
#>              title                                                  
#>   AH104727 | hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#>   AH104729 | hg38.MA0139.1.RData
# Same for mm10 mouse genome
# subset(query_data, species == "Mus musculus" & genome == "mm10" & dataprovider == "JASPAR 2022")

The hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2 object contains CTCF sites detected using the all three CTCF PWMs. To retrieve, we’ll use:

# hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2
CTCF_hg38_all <- query_data[["AH104727"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("GenomicRanges")
#> Warning: package 'S4Vectors' was built under R version 4.4.1
#> Warning: package 'IRanges' was built under R version 4.4.1
CTCF_hg38_all
#> GRanges object with 3093041 ranges and 5 metadata columns:
#>             seqnames            ranges strand |                   name
#>                <Rle>         <IRanges>  <Rle> |            <character>
#>         [1]     chr1       11212-11246      + | JASPAR2022_CORE_vert..
#>         [2]     chr1       11399-11432      + | JASPAR2022_CORE_vert..
#>         [3]     chr1       11414-11432      + | JASPAR2022_CORE_vert..
#>         [4]     chr1       12373-12406      + | JASPAR2022_CORE_vert..
#>         [5]     chr1       13507-13541      + | JASPAR2022_CORE_vert..
#>         ...      ...               ...    ... .                    ...
#>   [3093037]     chrY 57215115-57215148      - | JASPAR2022_CORE_vert..
#>   [3093038]     chrY 57215146-57215164      - | JASPAR2022_CORE_vert..
#>   [3093039]     chrY 57215146-57215179      - | JASPAR2022_CORE_vert..
#>   [3093040]     chrY 57215332-57215366      - | JASPAR2022_CORE_vert..
#>   [3093041]     chrY 57216319-57216352      - | JASPAR2022_CORE_vert..
#>                 score    pvalue    qvalue               sequence
#>             <numeric> <numeric> <numeric>            <character>
#>         [1]   7.77064  5.25e-05     0.459 gtgctgtgccagggcgcccc..
#>         [2]  18.48780  2.54e-07     0.118 cagcacgcccacctgctggc..
#>         [3]   9.11475  5.65e-05     0.555    ctggcagctggggacactg
#>         [4]   9.21951  5.25e-05     0.421 CAGCAGGTCTGGCTTTGGCC..
#>         [5]   9.71560  2.24e-05     0.397 GTGCCCTTCCTTTGCTCTGC..
#>         ...       ...       ...       ...                    ...
#>   [3093037]   8.06504  9.11e-05     0.614 CTGCTGGGCCCTCTTGCTCC..
#>   [3093038]   9.11475  5.65e-05     0.726    CTGGCAGCTGGGGACACTG
#>   [3093039]  17.91870  3.72e-07     0.246 CAGCACGCCCGCCTGCTGGC..
#>   [3093040]   7.77064  5.25e-05     0.584 GTGCTGTGCCAGGGCGCCCC..
#>   [3093041]   8.63415  6.96e-05     0.595 CTGCATTTGCGTTCCGACGC..
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

The hg38.MA0139.1 object contains CTCF sites detected using the most popular MA0139.1 CTCF PWM. To retrieve:

# hg38.MA0139.1
CTCF_hg38 <- query_data[["AH104729"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
CTCF_hg38
#> GRanges object with 887980 ranges and 5 metadata columns:
#>            seqnames            ranges strand |        name     score    pvalue
#>               <Rle>         <IRanges>  <Rle> | <character> <numeric> <numeric>
#>        [1]     chr1       11414-11432      + |    MA0139.1   9.11475  5.65e-05
#>        [2]     chr1       14316-14334      + |    MA0139.1   7.83607  9.71e-05
#>        [3]     chr1       15439-15457      + |    MA0139.1   8.00000  9.08e-05
#>        [4]     chr1       16603-16621      + |    MA0139.1   8.04918  8.89e-05
#>        [5]     chr1       16651-16669      + |    MA0139.1  11.42620  1.97e-05
#>        ...      ...               ...    ... .         ...       ...       ...
#>   [887976]     chrY 57209918-57209936      - |    MA0139.1  11.42620  1.97e-05
#>   [887977]     chrY 57209966-57209984      - |    MA0139.1   8.04918  8.89e-05
#>   [887978]     chrY 57211133-57211151      - |    MA0139.1   8.00000  9.08e-05
#>   [887979]     chrY 57212256-57212274      - |    MA0139.1   7.83607  9.71e-05
#>   [887980]     chrY 57215146-57215164      - |    MA0139.1   9.11475  5.65e-05
#>               qvalue            sequence
#>            <numeric>         <character>
#>        [1]     0.555 ctggcagctggggacactg
#>        [2]     0.601 GGACCAACAGGGGCAGGAG
#>        [3]     0.599 TAGCCTCCAGAGGCCTCAG
#>        [4]     0.597 CCACCTGAAGGAGACGCGC
#>        [5]     0.504 TGGCCTACAGGGGCCGCGG
#>        ...       ...                 ...
#>   [887976]     0.648 TGGCCTACAGGGGCCGCGG
#>   [887977]     0.770 CCACCTGAAGGAGACGCGC
#>   [887978]     0.770 TAGCCTCCAGAGGCCTCAG
#>   [887979]     0.770 GGACCAACAGGGGCAGGAG
#>   [887980]     0.726 CTGGCAGCTGGGGACACTG
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

It is always advisable to sort GRanges objects and keep stan

View on GitHub
GitHub Stars8
CategoryDevelopment
Updated8mo ago
Forks1

Languages

R

Security Score

62/100

Audited on Jul 20, 2025

No findings