SkillAgentSearch skills...

Ggchord

An R function built on ggplot2 that visualizes pairwise BLASTN alignment results as chord diagrams, intuitively displaying homologous regions between query and subject sequences.

Install / Use

/learn @DangJem/Ggchord
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

🌐 Language Switch: 【现代汉语(Han) | 英文(English)

ggchord: Multi-Sequence BLAST Alignment Chord Diagram Visualization Tool

Overview

ggchord is an R function based on ggplot2 for visualizing BLAST alignment results of multiple sequences as intuitive chord diagrams. It supports extensive style customization, making it easy to display homologous regions and structural relationships between sequences. Version 0.1.0 of ggchord represents a breakthrough upgrade from simple multi-sequence chord diagrams to more feature-rich multi-sequence chord diagrams, capable of simultaneously showing alignment relationships between multiple sequences:

  • Each sequence is presented as an arc or custom track, with length proportionally mapped.
  • Colored ribbons represent alignment regions between sequences, supporting coloring by similarity or source.
  • Equipped with customizable axes for precise annotation of sequence positions and lengths.
  • Supports layout optimizations such as global rotation and sequence orientation adjustment to adapt to different analysis scenarios.

It is suitable for research in comparative genomics, pan-genome analysis, phage-host sequence relationship studies, etc., helping researchers quickly identify homologous patterns between sequences.

Key Features

  • Multi-sequence Support: Simultaneously display alignment relationships of 2 or more sequences, no longer limited to pairwise comparisons.
  • Sequence-level Customization:
    • Customize sequence order, orientation (forward/reverse), gaps, and radii.
    • Automatically or manually specify sequence colors and labels to improve readability.
  • Refined Axes:
    • Each sequence has independent axes with major/minor ticks, clearly labeling length positions.
    • Adjust tick lengths, label sizes, and offsets to balance aesthetics and information density.
  • Flexible Ribbon Styles:
    • 3 coloring schemes (single color, by query sequence, gradient by similarity).
    • Adjustable gap between ribbons and sequences; supports customization of Bézier curve control points for smoothness.
  • Layout Optimization: The entire graph can be rotated to meet different display needs.
  • Debug Mode: Assists in troubleshooting data issues by displaying counts of valid/invalid alignments.

Installation

Dependencies

  • R (≥ 3.6.0)
  • ggplot2 (≥ 3.3.0)
  • ggnewscale (≥ 0.5.0)
  • RColorBrewer
install.packages("ggplot2")
install.packages("ggnewscale")
install.packages("RColorBrewer")

How to install ggchord?

Install the stable version of ggchord from CRAN:

install.packages("ggchord")

If you want the development version, install it from GitHub:

devtools::install_github("DangJem/ggchord") or install.packages("ggchord_0.2.0.tar.gz")

Usage Instructions

Preliminary Data Preparation

Three types of input data need to be prepared:

【Required】Sequence Information Data (seq_data)

A TSV (Tab-Separated Values) file containing basic sequence information, must include the following columns:

  • seq_id: Unique sequence identifier (e.g., gene name, accession number)
  • length: Sequence length (positive number)

Example:

seq_data <- read.delim("seq_track.tsv", sep = "\t", stringsAsFactors = FALSE)

The format of seq_track.tsv is as follows (example):

seq_id	length
MT108731.1	64323
MT118296.1	32090
OQ646790.1	57367
OR222515.1	83080

You can automatically generate this table from FASTA files using the following command:

seqkit fx2tab -nil *fna | sed '1i seq_id\tlength' > seq_track.tsv

【Optional】Alignment Data (ribbon_data)

A TSV (Tab-Separated Values) file containing BLAST alignment results (convertible from outfmt6 or outfmt7 formats), must include the following columns:

  • qaccver: Query sequence ID (must exist in seq_data$seq_id)
  • saccver: Subject sequence ID (must exist in seq_data$seq_id)
  • length: Alignment length
  • pident: Sequence similarity (percentage)
  • qstart/qend: Start/end positions of the alignment on the query sequence
  • sstart/send: Start/end positions of the alignment on the subject sequence

You can use the following script to perform BLAST alignments on example sequences and obtain results in outfmt7 format:

# Script to run BLAST alignments using example FASTA files
seqs=("MT108731.1" "MT118296.1" "OQ646790.1" "OR222515.1")
seqsNum=${#seqs[@]}
ext="fna"
for ((i=0; i<seqsNum-1; i++)); do
  for ((j=i+1; j<seqsNum; j++)); do
    echo -e "Running BLASTN: ${seqs[$i]} vs ${seqs[$j]}"
    blastn \
      -outfmt '7 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs qlen slen sstrand stitle' \
      -query "${seqs[$i]}.${ext}" \
      -subject "${seqs[$j]}.${ext}" \
      -out "${seqs[$i]}__${seqs[$j]}.o7"
  done
done

【Optional】Gene Data (gene_data)

A TSV (Tab-Separated Values) file, which must include the following columns:

  • seq_id: Unique sequence identifier, must correspond to seq_id in the sequence information data (seq_data), such as gene names, accession numbers, etc.
  • start: Gene start position
  • end: Gene end position
  • strand: Strand direction (usually + for forward, - for reverse)
  • anno: Gene annotation, such as functional description of the gene

The format of the example file gene_track.tsv is as follows:

seq_id	start	end	strand	anno
MT108731.1	100	200	+	DNA binding protein
MT118296.1	300	400	-	Transcription factor

You can convert GFF3 format files into a gene data table using the gff2gene_track.R script. The script content is as follows:

library(tidyverse)

# Get paths of all gff3 files in the current directory
gff3FilesPath <- list.files(path = ".", pattern = "*.gff3")

# Read all gff3 files and merge into a data frame
gff3Table <- map_df(gff3FilesPath,~read_tsv(.x,show_col_types = F,comment = "#",col_names = F) %>% set_names(c("seq_id", "source", "type", "start", "end", "score", "strand", "phase", "attributes")))

# Filter records of type CDS and extract annotation information
geneTrackTable <- gff3Table %>% filter(type=="CDS") %>% mutate(anno=str_extract(attributes,"(?<=product=)[^;]+(?=;)")) %>% select(seq_id,start,end,strand,anno)

# Save the processed data frame as a TSV file
write_tsv(geneTrackTable,"gene_track.tsv")

After running the above script, a gene_track.tsv file will be generated in the current directory, which can be used as gene data for subsequent analysis and visualization.

Usage Examples

Data Reading

# Read sequence length data
seq_data <- read.delim("seq_track.tsv", sep = "\t", stringsAsFactors = FALSE)

# Read and process BLAST data
read_blast <- function(file) {
  df <- read.delim(file, sep = "\t", header = FALSE, stringsAsFactors = FALSE, comment.char = "#")
  colnames(df) <- c("qaccver","saccver","pident","length","mismatches","gapopen",
                    "qstart","qend","sstart","send","evalue","bitscore",
                    "qcovs","qlen","slen","sstrand","stitle")
  df
}
blast_files <- list.files(path = ".", pattern = "*.o7", full.names = TRUE)
all_blast <- do.call(rbind, lapply(blast_files, read_blast))
ribbon_data <- subset(all_blast, length >= 100)

# Read gene annotation data; to make the image more aesthetically pleasing, shorter gene annotations are filtered out here
gene_data <- read.delim("gene_track.tsv", sep = "\t", stringsAsFactors = FALSE) |> dplyr::slice_max(order_by = end-start, n = 5, by = seq_id)

Passing Only Essential seq_data

For ggchord, sequence data is the most important and indispensable. By default, sequences will be arranged counterclockwise in the order of the input seq_data. Of course, these can be modified.

part1_1 <- ggchord(
  seq_data = seq_data,
)

plot

For example, in the following example, you can control the order, orientation, and curvature of sequences using seq_order, seq_orientation, and seq_curvature, and set sequence colors using seq_colors.

part1_2 <- ggchord(
  seq_data = seq_data,
  seq_order = c("MT118296.1", "OR222515.1", "MT108731.1", "OQ646790.1"),
  seq_orientation = c(1,-1,1,-1),
  seq_curvature = c(0,2,-2,6),
  seq_colors = c("steelblue", "orange", "pink", "yellow")
)

plot

Adding Sequence Alignment Data

For gene alignment chord diagrams, sequence alignment is undoubtedly our main focus, so ribbon_data is the most important data next to seq_data. By default, the fill color of ribbons is determined by the percentage identity in the BLAST results.

part2_1 <- ggchord(
  seq_data = seq_data,
  ribbon_data = ribbon_data
)

plot

Of course, these can also be modified. For example, you can set the fill color to be based on the query sequence, making it easier for users to identify alignments between different sequences.

part2_2 <- ggchord(
  seq_data = seq_data,
  ribbon_data = ribbon_data,
  ribbon_color_scheme = "query"
)

plot

If you think color is not important, you can also set it to a single color.

part2_3 <- ggchord(
  seq_data = seq_data,
  ribbon_data = ribbon_data,
  ribbon_color_scheme = "single",
  ribbon_colors = "orange"
)

plot

In addition, ribbons will automatically adjust to perfectly match parameters such as sequence orientation, curvature, spacing, and radius (note: the same applies to axes and gene arrows).

The current version still has some issues; image distortion may occur with certain parameter combinations, which will be fixed in future versions.

part2_4 <- ggchord(
  seq_data = seq_data,
  ribbon_data = ribbon_data,
  seq_orientation = c(1,-1,1,-1),
  seq_curvature = c(0,2,-2,6),
  seq_gap = c(.1,.05,.09,.05),
  seq_radius = c(1,5,1,1)
)

plot

Adding Gene Annotation Informa

View on GitHub
GitHub Stars7
CategoryDevelopment
Updated2mo ago
Forks0

Languages

R

Security Score

75/100

Audited on Jan 29, 2026

No findings