CUDSS.jl
No description available
Install / Use
/learn @exanauts/CUDSS.jlREADME
CUDSS.jl: Julia interface for NVIDIA cuDSS
Overview
CUDSS.jl is a Julia interface to the NVIDIA cuDSS library. NVIDIA cuDSS provides three factorizations (LDU, LDLᵀ, LLᵀ) for solving sparse linear systems on GPUs.
Why CUDSS.jl?
Unlike other CUDA libraries that are commonly bundled together, cuDSS is currently in preview. For this reason, it is not included in CUDA.jl. To maintain consistency with the naming conventions used for other CUDA libraries (such as CUBLAS, CUSOLVER, CUSPARSE, etc.), we have named this interface CUDSS.jl.
Installation
CUDSS.jl can be installed and tested through the Julia package manager:
julia> ]
pkg> add CUDSS
pkg> test CUDSS
Content
CUDSS.jl provides a structured approach for leveraging NVIDIA cuDSS functionalities.
It introduces the types CudssSolver and CudssBatchSolver along with four core routines: cudss, cudss_update, cudss_set, and cudss_get.
Additionally, specialized methods for the CuSparseMatrixCSR type have been incorporated for cholesky, ldlt, lu and \.
To further enhance performance, in-place variants including cholesky!, ldlt!, lu! and ldiv! have been implemented.
These variants optimize performance by reusing the symbolic factorization as well as storage.
This ensures efficient solving of sparse linear systems on GPUs.
Examples
Example 1: Sparse unsymmetric linear system with one right-hand side
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = Float64
n = 100
A_cpu = sprand(T, n, n, 0.05) + I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)
A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)
solver = CudssSolver(A_gpu, "G", 'F')
cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu; asynchronous=false)
cudss("solve", solver, x_gpu, b_gpu; asynchronous=false)
r_gpu = b_gpu - A_gpu * x_gpu
norm(r_gpu)
# In-place LU
d_gpu = rand(T, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_update(solver, A_gpu)
c_cpu = rand(T, n)
c_gpu = CuVector(c_cpu)
cudss("refactorization", solver, x_gpu, c_gpu; asynchronous=false)
cudss("solve", solver, x_gpu, c_gpu; asynchronous=false)
r_gpu = c_gpu - A_gpu * x_gpu
norm(r_gpu)
Example 2: Sparse symmetric linear system with multiple right-hand sides
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = Float64
R = real(T)
n = 100
p = 5
A_cpu = sprand(T, n, n, 0.05) + I
A_cpu = A_cpu + A_cpu'
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)
A_gpu = CuSparseMatrixCSR(A_cpu |> tril)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)
structure = T <: Real ? "S" : "H"
solver = CudssSolver(A_gpu, structure, 'L')
cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu; asynchronous=false)
cudss("solve", solver, X_gpu, B_gpu; asynchronous=false)
R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
norm(R_gpu)
# In-place LDLᵀ
d_gpu = rand(R, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_update(solver, A_gpu)
C_cpu = rand(T, n, p)
C_gpu = CuMatrix(C_cpu)
cudss("refactorization", solver, X_gpu, C_gpu; asynchronous=false)
cudss("solve", solver, X_gpu, C_gpu; asynchronous=false)
R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
norm(R_gpu)
Example 3: Sparse hermitian positive definite linear system with multiple right-hand sides
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = ComplexF64
R = real(T)
n = 100
p = 5
A_cpu = sprand(T, n, n, 0.01)
A_cpu = A_cpu * A_cpu' + I
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)
A_gpu = CuSparseMatrixCSR(A_cpu |> triu)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)
structure = T <: Real ? "SPD" : "HPD"
solver = CudssSolver(A_gpu, structure, 'U')
cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu; asynchronous=false)
cudss("solve", solver, X_gpu, B_gpu; asynchronous=false)
R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
norm(R_gpu)
# In-place LLᴴ
d_gpu = rand(R, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_update(solver, A_gpu)
C_cpu = rand(T, n, p)
C_gpu = CuMatrix(C_cpu)
cudss("refactorization", solver, X_gpu, C_gpu; asynchronous=false)
cudss("solve", solver, X_gpu, C_gpu; asynchronous=false)
R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
norm(R_gpu)
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
