CUDASW4

CUDASW++4.0: Ultra-fast GPU-based Smith-Waterman Protein Sequence Database Search

Generate Convert Improve

Install / Use

/learn @asbschmidt/CUDASW4

About this skill

Quality Score

0/100

README

CUDASW++4.0: Ultra-fast GPU-based Smith-Waterman Protein Sequence Database Search

Software requirements

Linux operating system with compatible CUDA Toolkit 12.2 or newer
C++17 compiler
zlib
make

Hardware requirements

A modern CUDA-capable GPU of generation Ampère or newer. We have tested CUDASW4 on Ampère (sm_80), Ada Lovelace (sm_89), and Hopper (sm_90). Older generations lack hardware-support for specific instructions and may run at reduced speeds or may not run at all.

Download

git clone https://github.com/asbschmidt/CUDASW4.git

Build

Our software has two components, makedb and align . makedb is used to construct a database which can be queried by align.

The build step compiles the GPU code for all GPU archictectures of GPUs detected in the system. The CUDA environment variable CUDA_VISIBLE_DEVICES can be used to control the detected GPUs. If CUDA_VISIBLE_DEVICES is not set, it will default to all GPUs in the system.

Build makedb: make makedb
Build align: make align
Build align for the GPU architecture of GPUs 0 and 1: CUDA_VISIBLE_DEVICES=0,1 make align

Database construction

Use makedb to create a database from a fasta file. The file can be gzip'ed. We support fasta files with up to 2 billion sequences.

mkdir -p dbfolder
./makedb input.fa(.gz) dbfolder/dbname [options]

Options:

--mem val : Memory limit. Can use suffix K,M,G. If makedb requires more memory, temp files in temp directory will be used. Default all available memory.
--tempdir val : Temp directory for temporary files. Must exist. Default is db output directory.

Querying the database

Use align to query the database. align has two mandatory arguments.

--query The query file which contains all queries
--db The path to the reference database constructed with makedb.

Run ./align --help to get a complete list of options.

By default, the results will be output to stdout in plain text. Results can be output to file instead (--of filename), and can be output as tab-separated values (--tsv). Example tsv output is given below.

| Query number | Query length | Query header | Result number | Result score | Reference length | Reference header | Reference ID in DB | |------------|------------|------------|------------|------------|------------| ------------|------------| | 0 | 144 | gi|122087146 | 0 | 541 | 148 | UniRef50_P02233 | 23128215 | | 0 | 144 | gi|122087146 | 1 | 444 | 144 | UniRef50_P02238 | 22381647 |

Selecting GPUs

Similar to the build process, align will use all GPUs that are set with CUDA_VISIBLE_DEVICES, or all GPUs if CUDA_VISIBLE_DEVICES is not set.

# use the gpus that are currently set in CUDA_VISIBLE_DEVICES
./align --query queries.fa(.gz) --db dbfolder/dbname

# use gpus 0 and 1 for only this command
CUDA_VISIBLE_DEVICES=0,1 ./align --query queries.fa(.gz) --db dbfolder/dbname

Scoring options

    --top val : Output the val best scores. Default val = 10.
    --mat val : Set substitution matrix. Supported values: blosum45, blosum50, blosum62, blosum80. Default val = blosum62.
    --gop val : Gap open score. Overwrites blosum-dependent default score.
    --gex val : Gap extend score. Overwrites blosum-dependent default score.

The default gap scores are listed in the following table.

| | blosum45 | blosum50 | blosum62 | blosum80 | |------------|----------|----------|----------|----------| | gap_open | -13 | -13 | -11 | -10 | | gap_extend | -2 | -2 | -1 | -1 |

Memory options

    --maxGpuMem val : Try not to use more than val bytes of gpu memory per gpu. This is not a hard limit. Can use suffix K,M,G. All available gpu memory by default.
    --maxTempBytes val : Size of temp storage in GPU memory. Can use suffix K,M,G. Default val = 4G
    --maxBatchBytes val : Process DB in batches of at most val bytes. Can use suffix K,M,G. Default val = 128M
    --maxBatchSequences val : Process DB in batches of at most val sequences. Default val = 10000000

Depending on the database size and available total GPU memory, the database is transferred to the GPU once for all queries, or it is processed in batches which requires a transfer for each query. Above options give some control over memory usage. For best performance, the complete database must fit into maxGpuMem times the number of used GPUs.

Other options

    --dpx : Use DPX instructions. Hardware support requires Hopper (sm_90) or newer. Older GPUs fall back to software emulation.
    --verbose : More console output. Shows timings.
    --printLengthPartitions : Print number of sequences per length partition in db.
    --interactive : Loads DB, then waits for sequence input by user
    --help : Print all options

Benchmark scripts

This repository includes some of our benchmark scripts. Benchmark results are written to files. Benchmark scripts use the file allqueries.fasta . Reference sequences will be downloaded.

Peak performance benchmark

Aligns file allqueries.fasta to a simulated database with equal sequences.

./runpeakbenchmark.sh kerneltype

kerneltype 0: half2, kerneltype 1: dpx_s16, kerneltype 2: dpx_s32, kerneltype 3: float

uniprot sprot benchmark

Downloads https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz (88 megabyte) to folder benchmarkdbs, then constructs the corresponding DB and queries file allqueries.fasta

./runsprotbenchmark.sh kerneltype

kerneltype 0: half2, kerneltype 1: dpx

uniref50 benchmark

Downloads https://ftp.expasy.org/databases/uniprot/current_release/uniref/uniref50/uniref50.fasta.gz (12 gigabyte) to folder benchmarkdbs, then constructs the corresponding DB and queries file allqueries.fasta

./rununiref50benchmark.sh kerneltype

kerneltype 0: half2, kerneltype 1: dpx

uniprot trembl benchmark

Downloads https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz (57 gigabyte) to folder benchmarkdbs, then constructs the corresponding DB and queries file allqueries.fasta

./runtremblbenchmark.sh kerneltype

kerneltype 0: half2, kerneltype 1: dpx

Publication

This work is presented in the following paper.

Schmidt, B., Kallenborn, F., Chacon, A. et al. CUDASW++4.0: ultra-fast GPU-based Smith–Waterman protein sequence database search. BMC Bioinformatics 25, 342 (2024). https://doi.org/10.1186/s12859-024-05965-6

Related Skills

feishu-drive

337.1k

things-mac

337.1k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)

clawhub

337.1k

Use the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com

yu-ai-agent

1.9k

编程导航 2025 年 AI 开发实战新项目，基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus，覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发（Manas Java 实现）、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽，帮你成为 AI 时代企业的香饽饽，给你的简历和求职大幅增加竞争力。

asbschmidt

View profile

View on GitHub

GitHub Stars46

CategoryData

Updated24d ago

Forks7

asbschmidt/CUDASW4

Languages

Cuda

Security Score

90/100

Audited on Mar 2, 2026

No findings