SIMDPirates.jl

⚠️ DEPRECATED: SIMDPirates.jl is now deprecated in favor of VectorizationBase.jl

This library serves two primary purposes:

Letting users write explicit SIMD code.
Serving as a code-gen backend for other libraries, such as LoopVectorization.jl.

The second point has been a major driving factor behind the API divergence between SIMD.jl and SIMDPirates.jl. That is, code-gen is a lot easier if multiple dispatch does the heavy lifting so that the same code does the correct thing based on type information.

The major differences are with the vload and vstore! API. They use zero-based indexing, and the behavior is a function of the input types:

julia> using SIMDPirates

julia> A = rand(100,100); ptrA = stridedpointer(A); # WARNING: don't let A get garbage collected

julia> vload(ptrA, (0,0)), A[1,1]
(0.7977634555508373, 0.7977634555508373)

julia> vload(ptrA, (1,)), A[2]
(0.13579836748463214, 0.13579836748463214)

The type _MM{W}(i) represents indexing a vector.

julia> vload(ptrA, (_MM{8}(8),))
SVec{8,Float64}<0.6145530413966958, 0.13905050452534073, 0.8536024612786386, 0.13206059984056195, 0.5746515798950431, 0.035588186094294816, 0.9061808924885322, 0.0761514370503289>

julia> A[9:16,1]'
1×8 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.614553  0.139051  0.853602  0.132061  0.574652  0.0355882  0.906181  0.0761514

julia> vload(ptrA, (_MM{8}(16),2))
SVec{8,Float64}<0.9345847434764896, 0.8778295861820791, 0.3882306993294067, 0.029132949582947543, 0.13643548789260773, 0.22573385104528088, 0.16953827538934285, 0.09210510294056884>

julia> A[17:24,3]'
1×8 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.934585  0.87783  0.388231  0.0291329  0.136435  0.225734  0.169538  0.0921051

julia> vload(ptrA, (2,_MM{8}(24)))
SVec{8,Float64}<0.4586224341251526, 0.21030061931083033, 0.12676185033224674, 0.03418338751245442, 0.1415585905885226, 0.5599978264570737, 0.8694201302322504, 0.5101382821233793>

julia> A[3,25:32]'
1×8 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.458622  0.210301  0.126762  0.0341834  0.141559  0.559998  0.86942  0.510138

julia> vload(ptrA, (_MM{8}(24),_MM{8}(24)))
SVec{8,Float64}<0.41258101001567926, 0.7681445910047522, 0.49408560799133205, 0.8683185123046988, 0.0988985046194395, 0.382843770190751, 0.47204194244896036, 0.4655638468723473>

julia> getindex.(Ref(A), 25:32, 25:32)'
1×8 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.412581  0.768145  0.494086  0.868319  0.0988985  0.382844  0.472042  0.465564

You can also index using vectors:

julia> si = SVec(ntuple(i -> Core.VecElement(3i), Val(8)));

julia> vload(ptrA, (si,4))
SVec{8,Float64}<0.420209298966957, 0.09396816626228843, 0.4879807535620213, 0.7244630379947636, 0.7657242973977998, 0.37856664034180176, 0.14493820968814353, 0.26933496073958674>

julia> A[4:3:27,5]'
1×8 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.420209  0.0939682  0.487981  0.724463  0.765724  0.378567  0.144938  0.269335

The api for vstore!(::AbstractStridedPointer, v, i) is similar. The index determines the elements to which v is stored. If v is a scalar, it will be stored to each of the implied elements. However, you should manually reduce a vector to store at a scalar location, because whether you'd want sum, prod, or some other operation is not assumed.

The operations also take bitmasks (placed affter the index tuple) to perform masked loads/stores. This is useful for dealing with the ends of arrays, for example.

Using a single API where types determing behavior simplifies SIMD code geneartion in macros or generated functions: you only need a single version of the code producing expressions, and it can handle various contingencies. The more work we can move from meta programming to multiple dispatch, the better.

The _MM{W} type is represents indexing into an AbstractArray at locations _MM{W}.i + j for j ∈ 1:W, which is reflected in arithmetic operations:

julia> _MM{8}(0) + 36
_MM{8}(36)

julia> _MM{8}(36) * 5
SVec{8,Int64}<180, 185, 190, 195, 200, 205, 210, 215>

Offseting by 36 increments the index by 36, but multiplying by 5 must also multiply the step by 5. This allows one to implement cartesian indexing as simply the dot product between the cartesian indices and strides of the array. However, care must be taken to avoid multiplying _MM instances by 1 whenever possible as this will convert the _MM into an SVec with equivalent behavior (loads/stores to the same elements), but the inferior performance of a discontiguous memory accesses. Whenever a stride is known to equal 1 at compile time, as is commonly the case for the first stride, this should be exploited.

Older documenation begins here.

SIMDPirates.jl is a library for SIMD intrinsics. The code was stolen from , whose authors and maintainers deserve credit for most of the good work here. Aside from pirating code, SIMDPirates also provides an @pirate macro that lets you imagine you're commiting type piracy:

julia> @macroexpand @pirate v1 * v2 + v3 * v4
:(SIMDPirates.vmuladd(v1, v2, SIMDPirates.vmul(v3, v4)))

The functions SIMDPirates.vmuladd and SIMDPirates.vmul have methods defined on NTuple{W,Core.VecElement{<:Union{Float64,Float32}}}. By substituting base functions for methods with appropriate definitions, you can thus use base types without actual piracy. In general, the recomended approach however is to use SVec, a struct-wrapped vector which has overloads.

julia> vbroadcast(Val(8), 0.0) + 2
SVec{8,Float64}<2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0>

Anyone is more than welcome to take any of the code or changes I've made to this library and submit them to SIMD.jl.

Highlights

First, generating a few random vectors (if you're interested in the SIMD generation of random numbers, please see VectorizedRNG.jl.

julia> x = SVec(ntuple(Val(8)) do w Core.VecElement(rand()) end)
SVec{8,Float64}<0.5692277987210761, 0.33665348761817304, 0.03954926738748976, 0.3213190689556804, 0.8088511245418579, 0.35544805303664107, 0.3677375589109022, 0.4651001170793463>

julia> y = SVec(ntuple(Val(8)) do w Core.VecElement(rand()) end)
SVec{8,Float64}<0.4777741139597824, 0.06049602694925049, 0.3872217501123736, 0.486269129542215, 0.7425786365371663, 0.5857635301041568, 0.3686591983067562, 0.2412057239643277>

julia> z = SVec(ntuple(Val(8)) do w Core.VecElement(rand()) end)
SVec{8,Float64}<0.7913560950516021, 0.04884861331731183, 0.7385341388346818, 0.5228028153085258, 0.21908962866195014, 0.41415395968234314, 0.2655341712486954, 0.16997469510081653>

Fast flags on common operators, to allow for contractions:

julia> foo(x,y,z) = x * y - z
foo (generic function with 1 method)

julia> foo(x,y) = foo(x,y,0.0)
foo (generic function with 2 methods)

This results in the following asm:

#julia> @code_native debuginfo=:none foo(x,y,z)
	.text
	vmovupd	(%rsi), %zmm0
	vmovupd	(%rdx), %zmm1
	vfmsub213pd	(%rcx), %zmm0, %zmm1
	vmovapd	%zmm1, (%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nop

#julia> @code_native debuginfo=:none foo(x,y)
	.text
	vmovupd	(%rsi), %zmm0
	vmulpd	(%rdx), %zmm0, %zmm0
	vmovapd	%zmm0, (%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nopl	(%rax)

The arithmetic of the three argument foo was reduced to a single vfmsub213pd instruction (vectorized fused multiplication and subtraction of packed double precision numbers), while the two argument version dropped the 0.0 producing a vmulpd (vectorized multiplication of packed doubles).

For implementing compensated algorithms, SIMDPirates provides functions that prevent these optimizations:

julia> efoo(x,y,z) = SIMDPirates.evsub(SIMDPirates.evmul(x, y), z)
efoo (generic function with 1 method)

julia> efoo(x,y) = efoo(x, y, x - x)
efoo (generic function with 2 methods)


julia> @code_native debuginfo=:none efoo(x,y,z)
	.text
	vmovupd	(%rsi), %zmm0
	vmulpd	(%rdx), %zmm0, %zmm0
	vsubpd	(%rcx), %zmm0, %zmm0
	vmovapd	%zmm0, (%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nop

julia> @code_native debuginfo=:none efoo(x,y)
	.text
	vmovupd	(%rsi), %zmm0
	vmulpd	(%rdx), %zmm0, %zmm0
	vmovapd	%zmm0, (%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nopl	(%rax)

Although it still allides subtracting (x-x), the multiplication and subtraction haven't contracted.

Most of the more interesting additions and changes are related to memory management. The prefered means of masking is to use bitmasks instead of NTuple{W,Core.VecElement{Bool}}. This is in large part because working with bitmasks is extremely efficient on avx512 architectures. Note that each Bool is a byte, an i8 rather than an i1. Note that the zero extensions and truncations to convert back and fourth between them would almost certainly be ellided by the compiler. The advantage of the bit representation is that it can be convenient to generate masks by means other than a vectorized comparison. For example, if you have a loop of length N = 141, and are using vectors of width 8:

julia> 141 & 7
5

julia> one(UInt8) << ans - one(UInt8)
0x1f

julia> bitstring(ans)
"00011111"

julia> x = rand(5);

julia> vload(Val(8), pointer(x), 0x1f)
SVec{8,Float64}<0.9883925090112797, 0.5963333776815305, 0.39507716254066527, 0.20452877630045485, 0.11416439490499686, 0.0, 0.0, 0.0>

This can be an efficient means of safely calculating the remaining ite

SIMDPirates.jl

Install / Use

README

SIMDPirates.jl

⚠️ DEPRECATED: SIMDPirates.jl is now deprecated in favor of VectorizationBase.jl

Highlights