Slob
Data store for Aard 2
Install / Use
/learn @itkach/SlobREADME
- Slob Slob (sorted list of blobs) is a read-only, compressed data store with dictionary-like interface to look up content by text keys. Keys are sorted according to [[http://www.unicode.org/reports/tr10/][Unicode Collation Algorithm]]. This allows to perform punctuation, case and diacritics insensitive lookups. /slob.py/ is a reference implementation of slob format reader and writer in [[http://python.org][Python 3]].
** Installation
/slob.py/ depends on the following components:
- [[http://python.org][Python]] >= 3.6
- [[http://icu-project.org][ICU]] >= 4.8
- [[https://pypi.python.org/pypi/PyICU][PyICU]] >= 1.5
In addition, the following components are needed to set up slob environment:
- [[http://git-scm.com/][git]]
- [[https://virtualenv.pypa.io/][virtualenv]]
Consult your operating system documentation and these component's websites for installation instructions.
For example, on Ubuntu 20.04, the following command installs required packages:
#+BEGIN_SRC sh sudo apt update sudo apt install python3 python3-icu python3.8-venv git #+END_SRC
Create new Python virtual environment:
#+BEGIN_SRC sh python3 -m venv env-slob --system-site-packages #+END_SRC
Activate it:
#+BEGIN_SRC sh source env-slob/bin/activate #+END_SRC
Install from source code repository:
#+BEGIN_SRC sh pip install git+https://github.com/itkach/slob.git #+END_SRC
or, download source code manually:
#+BEGIN_SRC sh wget https://github.com/itkach/slob/archive/master.zip pip install master.zip #+END_SRC
Run tests:
#+BEGIN_SRC sh python -m unittest slob #+END_SRC
** Command line interface
/slob.py/ provides basic command line interface to inspect and modify slob content.
#+BEGIN_SRC usage: slob [-h] {find,get,info,tag} ...
positional arguments: {find,get,info,tag} sub-command find Find keys get Retrieve blob content info Inspect slob and print basic information about it tag List tags, view or edit tag value convert Create new slob with the same convent but different encoding and compression parameters or split into multiple slobs
optional arguments: -h, --help show this help message and exit #+END_SRC
To see basic slob info such as text encoding, compression and tags: #+BEGIN_SRC slob info my.slob #+END_SRC
To see value of a tag, for example /label/: #+BEGIN_SRC slob tag -n label my.slob #+END_SRC
To set tag value: #+BEGIN_SRC slob tag -n label -v "A Fine Dictionary" my.slob #+END_SRC
To look up a key, for example /abc/: #+BEGIN_SRC slob find wordnet-3.0.slob abc #+END_SRC
The output should like something like #+BEGIN_SRC 465 text/html; charset=utf-8 ABC 466 text/html; charset=utf-8 abcoulomb 472 text/html; charset=utf-8 ABC's 468 text/html; charset=utf-8 ABCs #+END_SRC
First column in the output is blob id. It can be used to retrieve blob content (content bytes are written to stdout): #+BEGIN_SRC slob get wordnet-3.0.slob 465 #+END_SRC
To re-encode or re-compress slob content with different parameters: #+BEGIN_SRC slob convert -c lzma2 -b 256 simplewiki-20140209.zlib.384k.slob simplewiki-20140209.lzma2.256k.slob #+END_SRC
To split into multiple slobs:
#+BEGIN_SRC slob convert --split 4096 enwiki-20150406.slob enwiki-20150406-vol.slob #+END_SRC
Output name /enwiki-20150406-vol.slob/ is the name of the directory where resulting .slob files will be created.
This is useful for crippled systems that can't use normal filesystems and have file size limits, such as SD cards on vanilla Android. Note that this command doesn't duplicate any content, so clients must search all these slobs when looking for shared resources such as stylesheets, fonts, javascript or images.
** Examples
*** Basic Usage
Create a slob:
#+BEGIN_SRC python
import slob
with slob.create('test.slob') as w:
w.add(b'Hello A', 'a')
w.add(b'Hello B', 'b')
#+END_SRC
Read content:
#+BEGIN_SRC python
import slob
with slob.open('test.slob') as r:
d = r.as_dict()
for key in ('a', 'b'):
result = next(d[key])
print(result.content)
#+END_SRC
will print
#+BEGIN_SRC
b'Hello A' b'Hello B' #+END_SRC
Slob we created in this example certainly works, but it is not
ideal: we neglected to specify content type for the content we
are adding. Lets consider a slightly more involved example:
#+BEGIN_SRC python
import slob
PLAIN_TEXT = 'text/plain; charset=utf-8'
with slob.create('test1.slob') as w:
w.add('Hello, Earth!'.encode('utf-8'),
'earth', 'terra', content_type=PLAIN_TEXT)
w.add_alias('земля', 'earth')
w.add('Hello, Mars!'.encode('utf-8'), 'mars',
content_type=PLAIN_TEXT)
#+END_SRC
Here we specify MIME type of the content we are adding so that
consumers of this content can display or process it
properly. Note that the same content may be associated with
multiple keys, either when it is added or later with /add_alias/.
This
#+BEGIN_SRC python
with slob.open('test1.slob') as r:
def p(blob):
print(blob.id, blob.content_type, blob.content)
for key in ('earth', 'земля', 'terra'):
blob = next(r.as_dict()[key])
p(blob)
p(next(r.as_dict()['mars']))
#+END_SRC
will print
#+BEGIN_SRC
0 text/plain; charset=utf-8 b'Hello, Earth!' 0 text/plain; charset=utf-8 b'Hello, Earth!' 0 text/plain; charset=utf-8 b'Hello, Earth!' 1 text/plain; charset=utf-8 b'Hello, Mars!' #+END_SRC
Note that blob id for the first three keys is the same, they all
point to the same content item.
Take a look at tests in /slob.py/ for more examples.
*** Software and Dictionaries
- [[https://github.com/itkach/slob/wiki/Dictionaries][Wikipedia, Wiktionary, WordNet, FreeDict and more]]
- [[http://github.com/itkach/aard2-android/][aard2-android]] - dictionary for Android
- [[https://github.com/farfromrefug/OSS-Dict][OSS-Dict]] - fork of Aard2 with new Material design and updated features
- [[https://github.com/itkach/aard2-web][aard2-web]] - minimalistic Web UI (Java)
- [[https://github.com/goldendict/goldendict][GoldenDict]] - Feature-rich dictionary lookup program supporting multiple dictionary formats including slob
- [[https://github.com/xiaoyifang/goldendict-ng][GoldenDict-ng]] - The Next Generation GoldenDict
- [[http://github.com/itkach/slobber/][slobber]] - Web API to look up content in slob dictionaries
- [[http://github.com/itkach/slobby/][slobby]] - minimalistic Web UI (Python)
- [[https://github.com/MuntashirAkon/SlobDict][SlobDict]] - modern, lightweight GTK 4 dictionary app for Linux (Python)
- [[https://github.com/ilius/pyglossary][pyglossary]] - convert dictionaries in various formats, including slob
- [[https://github.com/itkach/mw2slob][mw2slob]] - create slob dictionaries from Wikimedia Enterprise HTML Dumps or MediaWiki API
- [[http://github.com/itkach/xdxf2slob/][xdxf2slob]] - create slob dictionaries from XDXF
- [[https://github.com/itkach/tei2slob/][tei2slob]] - create slob dictionaries from TEI
- [[http://github.com/itkach/wordnet2slob/][wordnet2slob]] - convert WordNet databaset to slob dictionary
** Slob File Format
*** Slob
| Element | Type | Description | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | magic | fixed size sequence of 8 bytes | Bytes ~21 2d 31 53 4c 4f 42 1f~: string ~!-1SLOB~ followed by ascii unit separator (ascii hex code ~1f~) identifying slob format | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | uuid | fixed size sequence of 16 bytes | Unique slob identifier ([[https://tools.ietf.org/html/rfc4122][RFC 4122]] UUID) | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | encoding | tiny text (utf8) | Name of text encoding used for all other text elements: tag names and values, content types, keys, fragments | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | compression | tiny text | Name of compression algorithm used to compress storage bins. | | | | slob.py understands following names: /bz2/,
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
