Fs2
B+ Tree - List - File Structures 2 - Memory Mapped File Structures for Go
Install / Use
/learn @timtadh/Fs2README
fs2 - File Structures 2
by Tim Henderson (tadh@case.edu)
Licensed under the GNU GPL version 3 or at your option any later version. If you need another licensing option please contact me directly.
What is this?
- A B+ Tree implementation
- A list implementation supporting O(1) Append, Pop, Get and Set operations.
- A command to generate type specific wrappers around the above structures. It's generic, in Go, kinda.
- A platform for implementing memory mapped high performance file structures in Go.
Why did you make this?
In my academic research some of the algorithms I work on (such as frequent subgraph mining) have exponential characteristics in their memory usage. In order to run these algorithms on larger data sets they need to be able to transparently cache less popular parts of the data to disk. However, in general it is best to keep as much data in memory as possible.
Of course, there are many options for such data stores. I could have used a off the shelf database, however I also want to explore ways to achieve higher performance than those solutions offer.
Have you worked on this type of system before?
I have! In the golang world I believe I was the first to implement a disk backed B+ Tree. Here is an early commit from my file-structures repository. Note the date: February 21, 2010. Go was made public in November of 2009 (the first weekly was November 06, 2009). I started work on the B+ Tree in January of 2010.
This particular experiment is a follow up to my work in the file-structures repository. I have used those structures successfully many times but I want to experiment with new ways of doing things to achieve better results. Besides my disk backed versions of these structures you can also find good implementations of in memory version in my data-structures repository.
Limitations
Currently, fs2 is only available for Linux. I am looking for assistance making a darwin (Mac OS) port and a Windows port. Please get in touch if you are interested and have a Mac or Windows PC.
B+ Tree
This is a disk backed low level "key-value store". The closest thing similar to what it offers is Bolt DB. My blog post is a great place to start to learn more about the ubiquitous B+ Tree.
Features
-
Variable length size key or fixed sized keys. Fixed sized keys should be kept relatively short, less than 1024 bytes (the shorter the better). Variable length keys can be up to 2^31 - 1 bytes long.
-
Variable length values or fixed sized values. Fixed sized values should also be kept short, less than 1024 bytes. Variable length values can be up to 2^31 - 1 bytes long.
-
Duplicate key support. Duplicates are kept out of the index and only occur in the leaves.
-
Data is only written to disk when you tell it (or when need due to OS level page management).
-
Simple (but low level) interface.
-
Can operate in either a anonymous memory map or in a file backed memory map. If you plan to have a very large tree (even one that never needs to be persisted) it is recommend you use a file backed memory map. The OS treats pages in the file cache different than pages which are not backed by files.
-
The command
fs2-genericcan generate a wrapper specialized to your data type. Typing saved! To usego install github.com/timtadh/fs2/fs2-generic. Get help withfs2-generic --help
Limitations
-
Not thread safe and therefore no transactions which you only need with multiple threads.
-
Maximum fixed key/value size is ~1350 bytes.
-
Maximum variable length key/value size is 2^31 - 1
-
This is not a database. You could make it into a database or build a database on top of it.
Quick Start
Importing
import (
"github.com/timtadh/fs2/bptree"
"github.com/timtadh/fs2/fmap"
)
Creating a new B+ Tree (fixed key size, variable length value size).
bf, err := fmap.CreateBlockFile("/path/to/file")
if err != nil {
log.Fatal(err)
}
defer bf.Close()
bpt, err := bptree.New(bf, 8, -1)
if err != nil {
log.Fatal(err)
}
// do stuff with bpt
Opening a B+ Tree
bf, err := fmap.OpenBlockFile("/path/to/file")
if err != nil {
log.Fatal(err)
}
defer bf.Close()
bpt, err := bptree.Open(bf)
if err != nil {
log.Fatal(err)
}
// do stuff with bpt
Add a key/value pair. Note, since this is low level you have to serialize your
keys and values. The length of the []byte representing the key must exactly
match the key size of the B+ Tree. You can find out what that was set to by
called bpt.KeySize()
import (
"encoding/binary"
)
var key uint64 = 12
value := "hello world"
kBytes := make([]byte, 8)
binary.PutUvarint(kBytes, key)
err := bpt.Add(kBytes, []byte(value))
if err != nil {
log.Fatal(err)
}
As you can see it can be a little verbose to serialize and de-serialize your keys and values. So be sure to wrap that up in utility functions or even to wrap the interface of the B+ Tree so that client code does not have to think about it.
Since a B+Tree is a "multi-map" meaning there may be more than one value per
key. There is no "Get" method. To retrieve the values associated with a key use
the Find method.
{
var key, value []byte
kvi, err := bpt.Find(kBytes)
if err != nil {
log.Fatal(err)
}
for key, value, err, kvi = kvi(); kvi != nil; key, value, err, kvi = kvi() {
// do stuff with the keys and values
}
if err != nil {
log.Fatal(err)
}
}
That interface is easy to misuse if you do not check the error values as show in the example above. An easier interface is provided for all of the iterators (Range, Find, Keys, Values, Iterate) called the Do interface.
err = bpt.DoFind(kBytes, func(key, value []byte) error {
// do stuff with the keys and values
return nil
})
if err != nil {
log.Fatal(err)
}
It is recommended that you always use the Do* interfaces. The other is provided if the cost of extra method calls is too high.
Removal is also slightly more complicated due to the duplicate keys. This example will remove all key/value pairs associated with the given key:
err = bpt.Remove(kBytes, func(value []byte) bool {
return true
})
if err != nil {
log.Fatal(err)
}
to remove just the one I added earlier do:
err = bpt.Remove(kBytes, func(v []byte) bool {
return bytes.Equal(v, []byte(value))
})
if err != nil {
log.Fatal(err)
}
That wraps up the basic usage. If you want to ensure that the bytes you have written are in fact on disk you have 2 options
-
call bf.Sync() - Note this uses the async mmap interface under the hood. The bytes are not guaranteed to hit the disk after this returns but they will go there soon.
-
call bf.Close()
MMList
A Memory Mapped List. This list works more like a stack and less like a queue.
It is not a good thing to build a job queue on. It is a good thing to build a
large set of items which can be efficiently randomly sampled. It uses the same
varchar system that the B+Tree uses so it can store variably sized items up to
2^31 - 1 bytes long.
Operations
SizeO(1)AppendO(1)PopO(1)GetO(1)SetO(1)SwapO(1)SwapDeleteO(1)
I will consider implementing a Delete method. However, it will be O(n) since
this is implemented a bit like an ArrayList under the hood. The actual way it
works is there is a B+Tree which indexes to list index blocks. The list index
blocks hold pointers (511 of them) to varchar locations. I considered having a
restricted 2 level index but that would have limited the size of the list to a
maximum of ~1 billion items which was uncomfortably small to me. In the future
the implementation may change to use something more like an ISAM index which
will be a bit more compact for this use case.
Quickstart
package main
import (
"binary"
"log"
)
import (
"github.com/timtadh/fs2/fmap"
"github.com/timtadh/fs2/mmlist"
)
func main() {
file, err := fmap.CreateBlockFile("/tmp/file")
if err != nil {
log.Fatal(err)
}
defer file.Close()
list, err := mmlist.New(file)
if err != nil {
log.Fatal(err)
}
idx, err := list.Append([]byte("hello"))
if err != nil {
log.Fatal(err)
}
if d, err := list.Get(idx); err != nil {
log.Fatal(err)
} else if !bytes.Equal([]byte("hello"), d) {
log.Fatal("bytes where not hello")
}
if err := list.Set(idx, "bye!"); err != nil {
log.Fatal(err)
}
if d, err := list.Get(idx); err != nil {
log.Fatal(err)
} else if !bytes.Equal([]byte("bye!"), d) {
log.Fatal("bytes where not bye!")
}
if d, err := list.Pop(); err != nil {
log.Fatal(err)
} else if !bytes.Equal([]byte("bye!"), d) {
log.Fatal("bytes where not bye!")
}
}
fs2-generic
A command to generate type specialized wrappers around fs2 structures.
Since Go does not support generics and is not going to support generics anytime
soon, this program will produce a wrapper specialized to the supplied types. It
is essentially manually implementing type specialized generics in a very limited
form. All fs2 structures operate on sequences of bytes, aka []byte, because
they memory mapped and file backed structures. Therefore, the supplied types
must be serializable to be used as keys and values in an fs2 structure.
How to install
Assuming you already have the code downloaded and in your GOPATH just run:
$ go install github.com/timtadh/fs2/fs2-generic
