SkillAgentSearch skills...

Mandoline

A distributed, versioned, multi-dimensional array database

Install / Use

/learn @TheClimateCorporation/Mandoline
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

mandoline-core

Distributed, versioned, n-dimensional array database.

Build Status

What is Mandoline?

Mandoline is a Clojure library for reading and writing immutable, versioned datasets that contain multidimensional arrays. The Mandoline library can be extended to use different data store implementations. Currently supported data store implementations are:

Usage

If your project uses Leiningen, then it's as simple as sticking the following in your project.clj's :dependencies section:

Clojars Project

Please note that this will only give you the in-memory store. For a persistent store, use one of the options listed above.

Tutorial

Overview

This tutorial will walk you through:

  • The concept of metadata in Mandoline
  • The concept of a data slab in Mandoline
  • Creating a new dataset
  • Writing data to a new dataset
  • Reading data from a dataset
  • Writing data to a new version of a dataset
  • Reading data from multiple versions of a dataset
  • Deleting a dataset

For this tutorial, you need to start a Clojure REPL in the io.mandoline/mandoline-core project or in a project that includes io.mandoline/mandoline-core as a dependency.

Start the REPL and require/import the following:

    user=> (require '[io.mandoline :as mandoline])
    nil
    user=> (require '[io.mandoline.dataset :as dataset])
    nil
    user=> (require '[io.mandoline.slab :as slab])
    nil
    user=> (require '[io.mandoline.slice :as slice])
    nil
    user=> (require '[io.mandoline.impl :as impl])
    nil
    user=> (import '[ucar.ma2 Array])
    ucar.ma2.Array

A Mandoline dataset loosely resembles a NetCDF or Common Data Model dataset. A dataset contains zero or more variables (arrays) that are defined on named dimensions (array axes). Each variable is a (possibly multi-dimensional) array of homogeneous type that is defined on zero or more dimensions. Multiple variables can share dimensions.

To create a Mandoline dataset, you need to provide:

  1. a metadata map that defines the structure of the dataset, and
  2. slabs that contain array values to populate the variables

These ingredients will be described in the next two parts of this tutorial.

Metadata

As an example, define the following metadata map in the REPL (adapted from a real-world netCDF dataset):

    (def metadata
      {:dimensions
       {:longitude 144, :latitude 73, :time 62}
       :chunk-dimensions
        {:longitude 20, :latitude 20, :time 40}
       :variables
       {:longitude
        {:type "float"
         :fill-value Float/NaN
         :shape ["longitude"]}
        :latitude
        {:type "float"
         :fill-value Float/NaN
         :shape ["latitude"]}
        :time
        {:type "int"
         :fill-value Integer/MIN_VALUE
         :shape ["time"]}
        :tcw
        {:type "short"
         :fill-value Short/MIN_VALUE
         :shape ["time" "latitude" "longitude"]}}})

This metadata map describes a dataset that has this structure:

  • Dimensions - longitude: length is 144, and storage chunk size is 20 - latitude: length is 73, and storage chunk size is 20 - time: length is 62, and storage chunk size is 40
  • Variables - longitude: 1-dimensional array of type float with shape [144], defined on the longitude dimension - latitude: 1-dimensional array of type float with shape [73], defined on the latitude dimension - time: 1-dimensional array of type int with shape [62], defined on the time dimension - tcw: 3-dimensional array of type short with shape [62 73 144], defined on the dimensions [time latitude longitude]

You may wonder what is the significance of the :chunk-dimensions entry in the metadata map. It can be regarded as a leaky implementation detail or a hint to the underlying data store for Mandoline. Each variable is partitioned into non-overlapping tiles ("chunks") whose maximum extent along each dimension is specified by :chunk-dimensions.

You may also wonder what is the significance of the :fill-value entry that is associated with each variable in the metadata map. Mandoline requires a default element value for each variable so that it can optimize storage. This default value is specified by :fill-value and is mandatory.

You can use the function io.mandoline.dataset/validate-dataset-definition to check whether a metadata map is well-formed. This function throws an exception on invalid metadata and otherwise returns nil.

    user=> (dataset/validate-dataset-definition metadata)
    nil

Slab

Now you have a metadata map that describes the structure of the dataset. You also need data to populate the dataset. The Mandoline library enables you to write data to a contiguous section of a single variable, which is called a "slab". The namespace io.mandoline.slab defines a Slab record type. A Slab record has two fields

  1. The (possibly multi-dimensional) array data to be written, which must be a ucar.ma2.Array instance. The data type of this array data must match the data type of the destination variable.
  2. The ranges of array indices that specify where in the destination variable the array data is to be written, which must be an instance of the io.mandoline.slice/Slice record type. You can use the convenience function io.mandoline.slice/mk-slice to create a Slice instance. The slice must be compatible with the shape of the destination variable.

As an example, create a 1-dimensional slab with shape [10] that corresponds to the index range from 0 (inclusive) to 10 (exclusive) of a variable:

    user=> (let [array (Array/factory Float/TYPE (int-array [10]))
      #_=>       slice (slice/mk-slice [0] [10])]
      #_=>   (slab/->Slab array slice))
    #io.mandoline.slab.Slab{:data #<D1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 >, :slice #io.mandoline.slice.Slice{:start [0], :stop [10], :step [1]}}

Note that this slab is defined independently of any variable in a Mandoline dataset. It contains array values that can potentially be assigned to a subsection of a variable, but it does not inherently represent an assignment operation. If you were to attempt to write this slab to a specific variable in a dataset (as you will do later in this tutorial), Mandoline would fail the attempted write if any of the following conditions were not satisfied:

  • The destination variable has data type "float" to match the data type of the slab (Float/TYPE).
  • The destination variable is 1-dimensional, to match the 1-dimensional data in the slab.
  • The destination variable has an extent that is long enough so that indices from 0 (inclusive) to 10 (exclusive) along its 0th dimension are valid indices.

To populate a large variable, you will need to perform distributed writes with multiple slabs, where each slab fits in the memory of a single process but the collection of all slabs is prohibitively large. The Slab write interface of Mandoline is designed to support this use case.

To successfully write to a variable, a slab does not need to coincide with the chunks that are defined by :chunk-dimensions in the dataset's metadata map. Mandoline automatically partitions a slab into (possibly partial) chunks for storage.

Mandoline uses the Slab record type for reading data as well as for writing. The function io.mandoline/get-slice (which you will use later in this tutorial) returns a Slab instance.

To continue this tutorial, define the following slabs to write to your sample dataset:

    (def slabs
      {:longitude
       [(slab/->Slab
          (Array/factory
            Float/TYPE
            (int-array [144])
            (float-array (range 0 360 2.5)))
          (slice/mk-slice [0] [144]))]
       :latitude
       [(slab/->Slab
          (Array/factory
            Float/TYPE
            (int-array [73])
            (float-array (range 90 -92.5 -2.5)))
          (slice/mk-slice [0] [73]))]
       :time
       [(slab/->Slab
          (Array/factory
            Integer/TYPE
            (int-array [62])
            (int-array (range 898476 899214 12)))
          (slice/mk-slice [0] [62]))]
       :tcw
       [(slab/->Slab
          (Array/factory
            Short/TYPE
            (int-array [62 73 144])
            (short-array
              (repeatedly
                (* 62 73 144)
                #(short (rand-int Short/MAX_VALUE)))))
          (slice/mk-slice [0 0 0] [62 73 144]))]})

This slabs var is a map whose keys are variable keywords and whose values are data slabs. Because the example dataset is small, you can populate each variable with one slab that covers the entire extent of the variab

View on GitHub
GitHub Stars107
CategoryData
Updated9d ago
Forks17

Languages

Clojure

Security Score

80/100

Audited on Mar 19, 2026

No findings