Column
High-performance, columnar, in-memory store with bitmap indexing in Go
Install / Use
/learn @kelindar/ColumnREADME
Columnar In-Memory Store with Bitmap Indexing
This package contains a high-performance, columnar, in-memory storage engine that supports fast querying, update and iteration with zero-allocations and bitmap indexing.
Features
- Optimized, cache-friendly columnar data layout that minimizes cache-misses.
- Optimized for zero heap allocation during querying (see benchmarks below).
- Optimized batch updates/deletes, an update during a transaction takes around
12ns. - Support for SIMD-enabled aggregate functions such as "sum", "avg", "min" and "max".
- Support for SIMD-enabled filtering (i.e. "where" clause) by leveraging bitmap indexing.
- Support for columnar projection (i.e. "select" clause) for fast retrieval.
- Support for computed indexes that are dynamically calculated based on provided predicate.
- Support for concurrent updates using sharded latches to keep things fast.
- Support for transaction isolation, allowing you to create transactions and commit/rollback.
- Support for expiration of rows based on time-to-live or expiration column.
- Support for atomic merging of any values, transactionally.
- Support for primary keys for use-cases where offset can't be used.
- Support for change data stream that streams all commits consistently.
- Support for concurrent snapshotting allowing to store the entire collection into a file.
Documentation
The general idea is to leverage cache-friendly ways of organizing data in structures of arrays (SoA) otherwise known "columnar" storage in database design. This, in turn allows us to iterate and filter over columns very efficiently. On top of that, this package also adds bitmap indexing to the columnar storage, allowing to build filter queries using binary and, and not, or and xor (see kelindar/bitmap with SIMD support).
- Collection and Columns
- Querying and Indexing
- Iterating over Results
- Updating Values
- Expiring Values
- Transaction Commit and Rollback
- Using Primary Keys
- Storing Binary Records
- Streaming Changes
- Snapshot and Restore
- Examples
- Benchmarks
- Contributing
Collection and Columns
In order to get data into the store, you'll need to first create a Collection by calling NewCollection() method. Each collection requires a schema, which needs to be specified by calling CreateColumn() multiple times or automatically inferred from an object by calling CreateColumnsOf() function. In the example below we create a new collection with several columns.
// Create a new collection with some columns
players := column.NewCollection()
players.CreateColumn("name", column.ForString())
players.CreateColumn("class", column.ForString())
players.CreateColumn("balance", column.ForFloat64())
players.CreateColumn("age", column.ForInt16())
Now that we have created a collection, we can insert a single record by using Insert() method on the collection. In this example we're inserting a single row and manually specifying values. Note that this function returns an index that indicates the row index for the inserted row.
index, err := players.Insert(func(r column.Row) error {
r.SetString("name", "merlin")
r.SetString("class", "mage")
r.SetFloat64("balance", 99.95)
r.SetInt16("age", 107)
return nil
})
While the previous example demonstrated how to insert a single row, inserting multiple rows this way is rather inefficient. This is due to the fact that each Insert() call directly on the collection initiates a separate transacion and there's a small performance cost associated with it. If you want to do a bulk insert and insert many values, faster, that can be done by calling Insert() on a transaction, as demonstrated in the example below. Note that the only difference is instantiating a transaction by calling the Query() method and calling the txn.Insert() method on the transaction instead the one on the collection.
players.Query(func(txn *column.Txn) error {
for _, v := range myRawData {
txn.Insert(...)
}
return nil // Commit
})
Querying and Indexing
The store allows you to query the data based on a presence of certain attributes or their values. In the example below we are querying our collection and applying a filtering operation bu using WithValue() method on the transaction. This method scans the values and checks whether a certain predicate evaluates to true. In this case, we're scanning through all of the players and looking up their class, if their class is equal to "rogue", we'll take it. At the end, we're calling Count() method that simply counts the result set.
// This query performs a full scan of "class" column
players.Query(func(txn *column.Txn) error {
count := txn.WithValue("class", func(v interface{}) bool {
return v == "rogue"
}).Count()
return nil
})
Now, what if we'll need to do this query very often? It is possible to simply create an index with the same predicate and have this computation being applied every time (a) an object is inserted into the collection and (b) an value of the dependent column is updated. Let's look at the example below, we're fist creating a rogue index which depends on "class" column. This index applies the same predicate which only returns true if a class is "rogue". We then can query this by simply calling With() method and providing the index name.
An index is essentially akin to a boolean column, so you could technically also select it's value when querying it. Now, in this example the query would be around 10-100x faster to execute as behind the scenes it uses bitmap indexing for the "rogue" index and performs a simple logical AND operation on two bitmaps when querying. This avoid the entire scanning and applying of a predicate during the Query.
// Create the index "rogue" in advance
out.CreateIndex("rogue", "class", func(v interface{}) bool {
return v == "rogue"
})
// This returns the same result as the query before, but much faster
players.Query(func(txn *column.Txn) error {
count := txn.With("rogue").Count()
return nil
})
The query can be further expanded as it allows indexed intersection, difference and union operations. This allows you to ask more complex questions of a collection. In the examples below let's assume we have a bunch of indexes on the class column and we want to ask different questions.
First, let's try to merge two queries by applying a Union() operation with the method named the same. Here, we first select only rogues but then merge them together with mages, resulting in selection containing both rogues and mages.
// How many rogues and mages?
players.Query(func(txn *column.Txn) error {
txn.With("rogue").Union("mage").Count()
return nil
})
Next, let's count everyone who isn't a rogue, for that we can use a Without() method which performs a difference (i.e. binary AND NOT operation) on the collection. This will result in a count of all players in the collection except the rogues.
// How many rogues and mages?
players.Query(func(txn *column.Txn) error {
txn.Without("rogue").Count()
return nil
})
Now, you can combine all of the methods and keep building more complex queries. When querying indexed and non-indexed fields together it is important to know that as every scan will apply to only the selection, speeding up the query. So if you have a filter on a specific index that selects 50% of players and then you perform a scan on that (e.g. WithValue()), it will only scan 50% of users and hence will be 2x faster.
// How many rogues that are over 30 years old?
players.Query(func(txn *column.Txn) error {
txn.With("rogue").WithFloat("age", func(v float64) bool {
return v >= 30
}).Count()
return nil
})
Iterating over Results
In all of the previous examples, we've only been doing Count() operation which counts the number of elements in the result set. In this section we'll look how we can iterate over the result set.
As before, a transaction needs to be started using the Query() method on the collection. After which, we can call the txn.Range() method which allows us to iterate over the result set in the transaction. Note that it can be chained right after With..() methods, as expected.
In order to access the results of the iteration, prior to calling Range() method, we need to first load column reader(s) we are going to need, using methods such as txn.String(), txn.Float64(), etc. These prepare read/w
