[![Contributors][contributors-shield]][contributors-url] [![Forks][forks-shield]][forks-url] [![Stargazers][stars-shield]][stars-url] [![Issues][issues-shield]][issues-url]

<br /> <div align="center"> <a href="https://www.biorxiv.org/content/10.1101/2023.12.30.573697v1"> <img src="images/logo.png" alt="Logo" width="650" height="300"> </a> <h3 align="center">RepeatOBserver</h3> <p align="center"> An R package to visualize chromosome scale repeat patterns and predict centromere locations. <br /> <a href="https://github.com/celphin/RepeatOBserverV1/issues">Report Bug</a> </p> </div>  <details> <summary>Table of Contents</summary> <ol> <li> <a href="#getting-started">Getting Started</a> <ul> <li><a href="#software-needed">Software Needed</a></li> <li><a href="#r-package-installation">R Package Installation</a></li> <li><a href="#version-changes">Version changes</a></li> <li><a href="#basic-run">Basic run</a></li> <li><a href="#output">Output</a></li> <li><a href="#finding-repeat-sequences">Finding repeat sequences</a></li> </ul> </li> <li><a href="#citation">Citation</a></li> <li><a href="#contact-and-questions">Contact and Questions</a></li> <li><a href="#usage-examples">Usage Examples</a></li> <li><a href="#troubleshooting">Troubleshooting</a></li> </ol> </details>

Getting Started

RepeatOBserver is an R package that can be run on any chromosome scale reference genome assembly (e.g. fasta file). RepeatOBserver returns many plots describing the tandem repeats and clusters of transposons found across each chromosome. Based on the repeat patterns, RepeatOBserver also returns a predicted centromere location for each chromosome based on the repeat diversity across that chromosome.

You can learn more about the interpretations of the plots in our manuscript here: https://doi.org/10.1111/1755-0998.14084

Software needed

The following software are need to run the automatic RepeatOBserver script:

seqkit/2.3.1 : https://bioinf.shenwei.me/seqkit/
r/4.1.2 : https://cran.r-project.org/bin/windows/base/old/
(optional to see isochores) emboss/6.6.0 : https://emboss.sourceforge.net/download/

Newer versions of these software may work but the program has not yet been tested throughly in them. If you are unable to install any of the programs above you can run the RepeatOBserver code in R but the automated bash script will not work for you (see <a href="#troubleshooting">Troubleshooting</a> at the end of this page for details on how to run the code without this script).

Example software installation (using Compute Canada modules):

module load seqkit/2.3.1
module load StdEnv/2020 
module load emboss/6.6.0
module load r/4.1.2

R Package Installation

To install the R package "RepeatOBserverV1", you will first need to install the package devtools in your version of R.

 install.packages("devtools")

 library(devtools)

 install_github("celphin/RepeatOBserverV1") #to install the package
  # Select 1:All to install all the required packages

 library(RepeatOBserverV1) # to load the package

Version changes

Oct 28th 2025:

The Setup_Run_Repeats.sh has been updated and this should fix the renaming of the chromosomes to be in a more logical order. It is recommended to use this new script when using the program for the first time. You can also find the old script here is you want to stay with the previous naming system: Setup_Run_Repeats_old_Oct28_2025.sh
There is also a small update for the R package (RepeatOBserverV1) such that will now finish plotting the summary plots for all chromosomes. This should not affect most genomes but solves a rare error that occurs in some genomes.

Feb 20th, 2024: The older Setup_Run_Repeats_old_Feb20_2024.sh may still work but it is ideal to rerun the program from scratch (i.e. download the new script, reinstall R library and delete old folders). Changes include:

Ablity to run longer chromosomes in 400Mbp parts
An easier to use CGwalk and transform option
A changing in the output file structure and matching output documentation below
A list of the original chromosome names and what they have been renamed in the program
Summary plots showing all the chromosomes in one plot

Note that you should not get any different results but restarting the program with the old script may no longer work.

Example new plot:

![Example plot showing all chromosomes in Arabidopsis][product-all_chromosomes]

Basic run

Download a copy of the Setup_Run_Repeats.sh script from this github repo into the directory that you want to run the code in.

wget https://raw.githubusercontent.com/celphin/RepeatOBserverV1/main/Setup_Run_Repeats.sh

Make sure the script is executable and setup to run on unix.

chmod +x Setup_Run_Repeats.sh
dos2unix Setup_Run_Repeats.sh

Move your chromosome scale fasta file (needs to contain more than one chromosome, ie. a genome) into a directory that you want to run RepeatOBserverV1 in. Make sure to unzip/gunzip the file. In this directory, with your desired reference genome, you can run the default RepeatOBserverV1 commands automatically with the following command:

sh Setup_Run_Repeats.sh -i SpeciesName -f Reference_Genome.fasta -h H0 -c c -m m -g FALSE

Necessary parameters: |Parameter | Usage | Example Input| |----------| ------| -------------| | -i | Species Name | Fagopyrum (cannot contain an _ or space)| | -f | Reference genome fasta file| Fagopyrum_Main.fasta| | -h| Haplotype (string) | H0 (cannot contain an _ or space)| | -c | cpus available (any integer value) | 20 | | -m | memory available (MB) | 128000 | | -g | FALSE to run for AT DNAwalk or TRUE to run for CG DNAwalk | FALSE |

If you require an allocation to get enough memory or cpu (125G for 15 CPU is best) on your server, here is a slurm template to follow:

cat << EOF > SPP_repeats.sh
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --time=5:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=15
#SBATCH --mem=128000M

module load StdEnv/2020
module load seqkit/2.3.1
module load emboss/6.6.0
module load r/4.3.1

srun Setup_Run_Repeats.sh -i SpeciesName -f Reference_Genome.fasta -h H0 -c c -m m -g FALSE

EOF

sbatch SPP_repeats.sh

Some example/test code can be found <a href="https://github.com/celphin/RepeatOBserverV1/blob/main/Example_code/Example_test_code.R">here.</a> After running the Wine genome you should get the following plot for chromosome 7 repeat lengths 15 to 35 bp. ![Example Results Wine][product-example]

Output

Summary plots and output files can be found in:

cd <your-starting-directory>/output_chromosomes/Species_Haplotype/Summary_output

Note: chromosomes are named differently than in the original fasta file and you can find the new names in chromosome_renaming.txt

Missing data

You can use the 'tree' command in the folder above to see all subfolders and files described below. Folders that are missing from the list above did not finish. The program removes any scaffolds or chromosomes less than 5Mbp. You can try restarting the script with the exact same submission as before and it will start where it left off if it did not finish due to time restraints.

Output folders and summary files that you should find in the directory above, if the whole program worked:

|Main folders| Description (more details below) | |-----------------| ------------| | DNAwalks | 1D and 2D DNAwalks | | histograms | histogram centromere predictions and plots | | output_data | Raw data files including Shannon diversity, DNAwalks, Fourier transforms | | Shannon_div | Shannon diversity plots for each chromosome | | spectra | Heat maps of the Fourier transform output | | isochores | CG isochores plot made with the EMBOSS program, useful to see if centromere positions are associated with isochores |

|Summary files| Description | |-----------------| ------------| | chromosome_renaming.txt | New chromosome names assigned to each chromosome in the program | | Species_Haplotype_Histograms.png | All chromosomes histograms plotted in one figure | | Species_Haplotype_Shannon_div.png| All chromosomes Shannon_div plotted in one figure | | Species_Haplotype_rolling_mean_500Kbp_Shannon_div.png | All chromosomes Shannon_div in 500kbp rolling windows plotted in one figure |

Subfolders described:

DNAwalks contains:

|Folder/file name | Description | Example file| |-----------------| ------------|-------------| | 1D | 1D CG and AT DNAwalks, rainbow colours change every 10Kbp | Species_Haplotype_Chr1_DNAwalk1D_AT_total.png | | 2D | 2D DNAwalks, the 1D walks plotted against each other| Species_Haplotype_Chr1_DNAwalk2D_total.png |

histograms contains:

|Folder/file name | Description | |-----------------| ------------| | Centromere_histograms_summary.txt | The predicted centromere positions for every chromosome based on the histogram output | | Species_Haplotype_Chr1_histogram_....png | Histogram plots showing counts of where in the genome each repeat length minimized|

output_data contains:

|Folder/file name | Description | |-----------------

RepeatOBserverV1

Install / Use

README