CSBFinder
A Java Desktop application with a graphical user interface for the discovery of colinear syntenic blocks across thousands microbial genomes
Install / Use
/learn @dinasv/CSBFinderREADME
CSBFinder-S
- Overview
- Prerequisites
- Running CSBFinder-S
- Input files formats
- Output files
- Example of running CSBFinder-S
- User interface features
- License
- Author
- Credit
<a name='overview'>Overview</a>
CSBFinder-S is a standalone Desktop java application with a graphical user interface, that can also be executed via command line.
TL:DR Watch this video from the ISMB conference to understand what CSBFinder is all about, including some examples on real data.
CSBFinder-S implements a novel methodology for the discovery and ranking of colinear syntenic blocks (CSBs) - groups of genes that are consistently located close to each other, in the same order, across a wide range of taxa. CSBFinder-S incorporates efficient algorithms that identify CSBs in large genomic datasets. The discovered CSBs are ranked according to a probabilistic score and clustered to families according to their gene content similarity.
The overall toolkit includes two components, implementing two distinct algorithms and released in separate versions. The first, denoted CSBFinder (published in (Svetlitsky et. al., 2018), cited below), incorporated a suffix-tree based algorithm, and was optimized to seek single-operon CSBs. The second version, CSBFinder-S (Svetlitsky et. al., 2020, cited below), generalizes the tool to cross-strand, multi operon CSBs and incorporates a match-point arithmetic based algorithm to efficiently support the generalizations.
March 27, 2019 update
CSBFinder-S for the discovery of cross-strand multi-operon CSBs is released
-
In this version, the user can decide whether to segment the input genomes into directons (consecutive genes on the same strand)
-
A novel exact algorithm that uses match-point arithmetic is proposed and implemented. The time and space complexities of the algorithm are insensitive to the number of insertions and maximal CSB length. The new algorithm is faster than the algorithm given in (Svetlitsky et. al., 2018) for larger values of insertions allowed. Additional advantages of the new algorithm are its simplicity of implementation, and the fact that it is easily parallelizable, yielding further scalability.
CSBFinder-S provides several novel mechanisms to help the user sort, filter, and interpret the discovered CSBs.
-
A ranking score that considers the genomic distances between the genomes in which the corresponding CSBs appear.
-
The user can constrain the structural features of the desired CSBs (length, abundance, etc.), as well as to extract CSBs confined to specific functional semantic categories.
-
A taxonomic viewer of the genomes that contain instances of each CSB.
-
Many other improvements have been incorporated in the user interface
Workflow Description
The workflow of CSBFinder-S is given in the figure below.
(A) The input to the workflow is a dataset of input genomes, where each genome is modeled as a sequence of gene identifiers: A gene identifier indicates the corresponding gene orthology group as well as the strand (+/-) in which the gene is encoded. Additional input consists of user-specified parameters k (number of allowed insertions) and q (the quorum parameter). In our formulation, a CSB is a pattern that appears as a substring of at least one of the input genomes, and has instances in at least q of the input genomes, where each instance may vary from the CSB pattern by at most k gene insertions.
(B) The genomes are mined to identify all patterns that qualify as CSBs according to the user-specified parameters.
(C) All discovered CSBs are ranked according to a probabilistic score.
(D) The CSBs are clustered to families according to their gene content similarity, and the rank of a family is determined by the score of its highest scoring CSB.

Citation
The following paper contains details regarding the first version of CSBFinder-S, denoted CSBFinder, that targeted the extraction of CSBs that correspond to operons. It contains details of the Suffix-Tree based algorithm for CSB extraction. The options to use the Suffix-Tree based algorithm, and the extraction of directon CSBs, are still available in the new CSBFinder-S tool.
If you used the tool as part of your research, please cite us:
When searching for cross-strand colinear syntenic blocks:
Dina Svetlitsky, Tal Dagan, Michal Ziv-Ukelson, Discovery of multi-operon colinear syntenic blocks in microbial genomes, Bioinformatics, Volume 36, Issue Supplement_1, July 2020, Pages i21–i29, https://doi.org/10.1093/bioinformatics/btaa503
When searching for colinear syntenic blocks that are conserved on the same strand:
Dina Svetlitsky, Tal Dagan, Vered Chalifa-Caspi, Michal Ziv-Ukelson, CSBFinder: discovery of colinear syntenic blocks across thousands of prokaryotic genomes, Bioinformatics, Volume 35, Issue 10, 15 May 2019, Pages 1634–1643, https://doi.org/10.1093/bioinformatics/bty861
<a name='prerequisites'>Prerequisites</a>
Java Runtime Environment (JRE) 8 or higher.
<a name='running'>Running CSBFinder-S</a>
<a name='download'>Download</a>
- Download the latest release of CSBFinder-S installer.
- The available options are Windows 64 or 32 bit, Unix and MacOS
CSBFinder-S has a user interface, but can be executed via the command line by executing the JAR file in the installation folder.
<a name='ui'>Running CSBFinder-S via User Interface</a>
Just double click on the CSBFinder-S executable file in the installation folder, or (if you checked these options during installation) from the Start Menu / Desktop.
Note: If you are going to use a very large input dataset you might need to change the maximal memory that can be used by CSBFinder-S. Go to the installation folder and edit the file "CSBFinder-S.vmoptions" using a Text Editor. Change the Java option
-Xmx500mto-Xmx[maximal heap size]depending of the available RAM. For example-Xmx6gsets the maximal JAVA heap size to 6GB.
It is recommended to use at least 6GB for a large dataset. You can specify a higher number, depending on you RAM size.
Importing input files
-
Importing a file containing the input genomes:
- Choose
File->Import->Genomes File. If your dataset is large, this make take a few minutes.
Sample input files are provided in the input directory in the installation folder
-
The "Run" button should be enabled. Click on this button to set the parameters.
-
A progressBar appears. Hover over the question mark icon next to each parameter for an explanation of each parameter. After setting the parameters, click on "Run". This can take a few minutes, depending on the size of the dataset and on the parameters specified.
-
After the process is done, the lower panel will contain all the discovered CSBs.
- Choose
-
Importing a saved session file:
If you have ran CSBFinder-S and saved a session file, you can load it by choosingFile->Import->Session File -
Importing gene orthology group information (OPTIONAL):
Load it by choosingFile->Import->Orthology Information file. This information will be displayed on the lower right panel. -
Importing taxonomic information (OPTIONAL):
Load it by choosingFile->Import->Taxonomy File. This information will be displayed in theTaxa Viewtab in the upper panel -
Importing additional metadata (OPTIONAL):
Load it by choosingFile->Import->Genome Metadata File. This information will be displayed in theTaxa Viewtab in the upper panel
<a name='cmd'>Running CSBFinder-S via Command Line </a>
CSBFinder-S can be executed via the command line by executing the JAR file in the installation folder.
- In the terminal (linux) or cmd (windows) type:
java -jar CSBFinder-S-[version]-jar-with-dependencies.jar [options]Note: If your input dataset is very large, add the argument -Xmx6g (6g might be enough, but you can specify a higher number, depending on your RAM size). For example:
java -Xmx6g -jar CSBFinder-S-[version]-jar-with-dependencies.jar [options]Note: When running CSBFinder-S without command line arguments, the user interface will be launched.
Sample input files are provided below
Options:
Mandatory:
- -in INPUT_DATASET_FILE_NAME
Input file relative or absolute path. See Input files formats for more details. - -q QUORUM
The quorum parameter. Minimal number of input sequences that must contain a CSB instance.
Default: 1 Min Value: 1
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
