SR2C

SR2C: A Novel Structurally Redundant Short Reads Collapser for Optimizing Genomic Sequencing Data Compression

Generate Convert Improve

Install / Use

/learn @fahaihi/SR2C

About this skill

Quality Score

0/100

README

<h1> <img src="https://github.com/fahaihi/SR2C/blob/master/SR2C_LOGO.png" alt="Logo" width="653" height="163"> Structurally Redundant Short Reads Collapser </h1> A DNA sequencing data structure redundant short reads deduplicator via cyclic hash linkege <a href="#about-the-sr2c">About The SR2C</a> • <a href="#useage">Useage</a> • <a href="#dataset-acquisition">Dataset Acquisition</a> • <a href="#aknowledgements">Acknowledgements</a> • <a href="#additional-information">Additional Information</a> •

screenshot

About The SR2C

SR2C（Structurally Redundant Short Reads Collapser） is a short sequencing reads deduplicator based on the Cycle-Hash-Linkage data structure. SR2C aims to remove structurally redundant sequences in high-throughput genome sequencing data, including Direct Repeats (DRs), Mirror Repeats (MRs), Inverted Repeats (IRs), Paired Repeats (PRs), and Complementary Palindromes Repeats (CPRs). In the current version, we use SR2C for data compression optimization.

Useage

Clone the SR2C project from GitHub and compile it:

git clone https://github.com/fahaihi/SR2C.git
cd SR2C
chmod +x install.sh
./install.sh

The usage of SR2C command-line tool is as follows:

Usage:
Deduplication:
  ./SR2C -d [Save-Dir] -f [FastQ-File-Name]  -t [Threads-Num]
Recover:
  ./SR2C -r [Save-Dir] -t [Threads-Num]
Verify:
  ./SR2C -v [Save-Dir] -f [FastQ-File-Name]
Help (print this message)
  ./SR2C -h

An example:

A：To perform structural redundancy removal on the data/test.fastq file using 2 threads and save the result in the test directory, use the following command:

data=`pwd`/data/test.fastq
./SR2C -d test -f ${data} -t 2

Results：

FileName:/public/home/jd_sunhui/genCompressor/SR2C/data/test.fastq
SaveDIR: /public/home/jd_sunhui/genCompressor/SR2C/test
ReadLen: 80
Threads: 2
STEP1:Begin Cycle-HASH-Linkage~
   RLength:80
   FileName:/public/home/jd_sunhui/genCompressor/SR2C/data/test.fastq
   SaveFileName:/public/home/jd_sunhui/genCompressor/SR2C/test
   Thread:2
   QualityScoreFlag:0
   HeaderFlag:0
STEP2:Begin Load Data~
   Data.size(): 1500
   Load_Data(*) running over~
STEP3:Begin Build CHL~
   LOG:1500/1500 --> 100%
   Build_CHL(*) running over~
STEP4:Begin Files SAVING~
  File_Name_reads:/public/home/jd_sunhui/genCompressor/SR2C/test/reads.txt
   File_Name_count:/public/home/jd_sunhui/genCompressor/SR2C/test/count.txt
   File_Name_id_1:/public/home/jd_sunhui/genCompressor/SR2C/test/id_1.txt
   File_Name_id_and_pos:/public/home/jd_sunhui/genCompressor/SR2C/test/id_pos.txt
   File_Name_info:/public/home/jd_sunhui/genCompressor/SR2C/test/info.txt
   FatherNum : 971
   Func_File_Saving(*) over~
STEP5:End Cycle-HASH-Linkage~

B：To recover structural redundancy sequences from the test directory using 4 threads, use the following command:

./SR2C -r test -t 4

Results：

SaveDIR: /public/home/jd_sunhui/genCompressor/SR2C/test
Threads: 4
STEP1:Get Parameter over~
   ReadsNum:1500
   FatherNum:971
   RLength:80
   InputDir:/public/home/jd_sunhui/genCompressor/SR2C/test
STEP2:Load File over~
STEP3:Paralle Recover Row Data.
   CPU Cores:4
   LOG:971/971 --> 100%
STEP5:Save File Over.
   OutputSavedPath:/public/home/jd_sunhui/genCompressor/SR2C/test/recover.txt

Notes：If terminate called after throwing an instance of 'std::invalid_argument' error message happened, please run ./install.sh and try to redo the deduplication step (A).

C：Verify if it is lossless to recover the original sequencing reads

./SR2C -v test -f ${data}

Result：

FileA:/public/home/jd_sunhui/genCompressor/SR2C/data/test.fastq
FileB:/public/home/jd_sunhui/genCompressor/SR2C/test/recover.txt
Unable to recover sequences：0

Notes: Here is a time and memory testing script:：

/bin/time -v -p [your command]

Dataset Acquisition

The experiment used 8 real open-source sequencing datasets from the NCBI open-source database (https://www.ncbi.nlm.nih.gov): SRR8386204_2 SRR11994956 SRR17794741_1 SRR17794741_2 SRR8386204_1 SRR13556216_1 SRR16553126_1 和 SRR11995278 These datasets were used for experimental evaluation, consisting of a total of 105,016,192 reads and a data size of 25,607,473KB. Detailed descriptions of the datasets are as follows:

Table1

The experimental datasets were downloaded using the sra-tools package, and the script configuration can be found at https://github.com/ncbi/sra-tools. The dataset download script is as follows:

dataset-1： C.arietinum(鹰嘴豆) URL: https://www.ebi.ac.uk/ena/browser/view/SRR13556216

cd SR2C/data
prefetch SRR13556216
fastq-dump SRR13556216
rm -rf SRR13556216 SRR13556216_2.fastq

dataset-2： Human(人类宏基因组) URL: https://www.ebi.ac.uk/ena/browser/view/SRR16553126

prefetch SRR16553126
fastq-dump SRR16553126 
rm -rf SRR16553126 SRR16553126_2.fastq

dataset-3&4：M.fascicularis(食蟹猕猴) URL: https://www.ebi.ac.uk/ena/browser/view/SRR8386204

prefetch SRR8386204
fastq-dump SRR8386204
rm -rf SRR8386204

dataset-5&6：Mouse.tumor(小鼠肿瘤) URL: https://www.ebi.ac.uk/ena/browser/view/SRR17794741

prefetch SRR17794741
fastq-dump --split-files SRR17794741
rm -rf  SRR17794741

dataset-7：S.fontinalis-1(美洲红点鲑) URL: https://www.ebi.ac.uk/ena/browser/view/SRR11995278

prefetch SRR11995278
fastq-dump SRR11995278
rm -rf SRR11995278

dataset-8：S.fontinalis-2(美洲红点鲑) URL: https://www.ebi.ac.uk/ena/browser/view/SRR11994956

prefetch SRR11994956
fastq-dump SRR11994956
rm -rf SRR11994956

Acknowledgements

Thanks to @HPC-GXU for the computing device support.
Thanks to @NCBI for all available datasets.
Thanks to @PIGZ-Project for Pigz source code.
Thanks to @PBZIP2-Project for PBzip2 source code.
Thanks to @XZ-Project for XZ source code.
Thanks to @7ZProject for 7Z source code.
Thanks to @Minirmd for 7Z source code.

Additional Information

Version： V1.2023.01.24.

Authors: NBJL-BioGrop.

ContactUS: https://nbjl.nankai.edu.cn OR sunh@nbjl.naikai.edu.cn

Supplementary: https://github.com/fahaihi/SR2C/blob/master/Supplementary.pdf

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。