Compute.scala <a href="http://thoughtworks.com/"><img align="right" src="https://www.thoughtworks.com/imgs/tw-logo.png" title="ThoughtWorks" height="15"/></a>

Compute.scala is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming DeepLearning.scala 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with ND4J.

Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory and reducing the performance impact due to garbage collection.
All dimensional transformation operators (permute, broadcast, translate, etc) in Compute.scala are views, with no additional data buffer allocation.
N-dimensional arrays in Compute.scala can be split to JVM collections, which support higher-ordered functions like map / reduce, and still can run on GPU.

Getting started

System Requirements

Compute.scala is based on LWJGL 3's OpenCL binding, which supports AMD, NVIDIA and Intel's GPU and CPU on Linux, Windows and macOS.

Make sure you have met the following system requirements before using Compute.scala.

Linux, Windows or macOS
JDK 8
OpenCL runtime

The performance of Compute.scala varies with OpenCL runtimes. For best performance, install the OpenCL runtime according to the following table.

| | Linux | Windows | macOS | | --- | --- | --- | --- | | NVIDIA GPU | NVIDIA GPU Driver | NVIDIA GPU Driver | macOS's built-in OpenCL SDK | | AMD GPU | AMDGPU-PRO Driver | AMD OpenCL™ 2.0 Driver | macOS's built-in OpenCL SDK | | Intel or AMD CPU | POCL | POCL | POCL |

Especially, Compute.scala produces non-vectorized code, which needs POCL's auto-vectorization feature for best performance when running on CPU.

Project setup

The artifacts of Compute.scala is published on Maven central repository for Scala 2.11 and 2.12. Add the following settings to your build.sbt if you are using sbt.

libraryDependencies += "com.thoughtworks.compute" %% "cpu" % "latest.release"

libraryDependencies += "com.thoughtworks.compute" %% "gpu" % "latest.release"

// LWJGL OpenCL library
libraryDependencies += "org.lwjgl" % "lwjgl-opencl" % "latest.release"

// Platform dependent runtime of LWJGL core library
libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release").jar().classifier {
  import scala.util.Properties._
  if (isMac) {
    "natives-macos"
  } else if (isLinux) {
    "natives-linux"
  } else if (isWin) {
    "natives-windows"
  } else {
    throw new MessageOnlyException(s"lwjgl does not support $osName")
  }
}

Check Compute.scala on Scaladex and LWJGL customize tool for settings for Maven, Gradle and other build tools.

Creating an N-dimensional array

Import types in gpu or cpu object according to the OpenCL runtime you want to use.

// For N-dimensional array on GPU
import com.thoughtworks.compute.gpu._

// For N-dimensional array on CPU
import com.thoughtworks.compute.cpu._

In Compute.scala, an N-dimensional array is typed as Tensor, which can be created from Seq or Array.

val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))

If you print out my2DArray,

println(my2DArray)

then the output should be

[[1.0,2.0,3.0],[4.0,5.0,6.0]]

You can also print the sizes of each dimension using the shape method.

// Output 2 because my2DArray is a 2D array.
println(my2DArray.shape.length)

// Output 2 because the size of first dimension of my2DArray is 2.
println(my2DArray.shape(0)) // 2

// Output 3 because the size of second dimension of my2DArray is 3.
println(my2DArray.shape(1)) // 3

So my2DArray is a 2D array of 2x3 size.

Scalar value

Note that a Tensor can be a zero dimensional array, which is simply a scalar value.

val scalar = Tensor(42.0f)
println(scalar.shape.length) // 0

Element-wise operators

Element-wise operators are performed for each element of in Tensor operands.

val plus100 = my2DArray + Tensor.fill(100.0f, Array(2, 3))

println(plus100) // Output [[101.0,102.0,103.0],[104.0,105.0,106.0]]

Design

Lazy-evaluation

Tensors in Compute.scala are immutable and lazy-evaluated. All operators that create Tensors are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested.

For example:

val a = Tensor(Seq(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
val b = Tensor(Seq(Seq(7.0f, 8.0f, 9.0f), Seq(10.0f, 11.0f, 12.0f)))
val c = Tensor(Seq(Seq(13.0f, 14.0f, 15.0f), Seq(16.0f, 17.0f, 18.0f)))

val result: InlineTensor = a * b + c

All the Tensors, including a, b, c and result are small JVM objects and no computation is performed up to now.

println(result.toString)

When result.toString is called, the Compute.scala compiles the expression a * b + c into one kernel program and execute it.

Both result and the temporary variable a * b are InlineTensors, indicating their computation can be inlined into a more complex kernel program. You can think of an InlineTensor as an @inline def method on device side.

This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively.

Check the Scaladoc seeing which operators return InlineTensor or its subtype TransformedTensor, which can be inlined into a more complex kernel program as well.

Caching

By default, when result.toString is called more than once, the expression a * b + c is executed more than once.

println(result.toString)

// The computation is performed, again
println(result.toString)

Fortunately, we provides a doCache method to eagerly allocate data buffer for a CachedTensor.

import com.thoughtworks.future._
import com.thoughtworks.raii.asynchronous._

val Resource(cachedTensor, releaseCache) = result.doCache.acquire.blockingAwait

try {
  // The cache is reused. No device-side computation is performed.
  println(cachedTensor.toString)

  // The cache is reused. No device-side computation is performed.
  println(cachedTensor.toString)

  val tmp: InlineTensor = exp(cachedTensor)
  
  // The cache for cachedTensor is reused, but the exponential function is performed.
  println(tmp.toString)

  // The cache for cachedTensor is reused, but the exponential function is performed, again.
  println(tmp.toString)
} finally {
  releaseCache.blockingAwait
}

// Crash because the data buffer has been released
println(releaseCache.toString)

The data buffer allocated for cachedTensor is kept until releaseCache is performed.

You can think of a CachedTensor as a lazy val on device side.

By combining pure Tensors along with the impure doCache mechanism, we achieved the following goals:

All Tensors are pure. No data buffer is allocated when creating them.
The computation of Tensors can be merged together, to minimize the number of intermediate data buffers and kernel programs.
The developers can create caches for Tensors, as a determinate way to manage the life-cycle of resources.

Mutable variables

Tensors are immutable, but you can create mutable variables of cached tensor to workaround the limitation.

var Resource(weight, releaseWeight) = Tensor.random(Array(32, 32)).doCache.acquire.blockingAwait

while (true) {
  val Resource(newWeight, releaseNewWeight) = (weight * Tensor.random(Array(32, 32))).doCache.acquire.blockingAwait
  
  releaseWeight.blockingAwait
  
  weight = newWeight
  releaseWeight = releaseNewWeight
}

Use this approach with caution. doCache should be only used for permanent data (e.g. the weights of a neural network). doCache is not designed for intermediate variables in a complex expression. A sophisticated Scala developer should be able to entirely avoid var and while in favor of recurisive functions.

Scala collection interoperability

`split`

A Tensor can be split into small Tensors on the direction of a specific dimension.

For example, given a 3D tensor whose shape is 2×3×4,

val my3DTensor = Tensor((0.0f until 24.0f by 1.0f).grouped(4).toSeq.grouped(3).toSeq)

val Array(2, 3, 4) = my3DTensor.shape

when split it at the dimension #0,

val subtensors0: Seq[Tensor] = my3DTensor.split(dimension = 0)

then the result should be a Seq of two 3×4 tensors.

// Output: TensorSeq([[0.0,1.0,2.0,3.0],[4.0,5.0,6.0,7.0],[8.0,9.0,10.0,11.0]], [[

Compute.scala

Install / Use

README