Dataux
Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore
Install / Use
/learn @dataux/DatauxREADME
Sql Query Proxy to Elasticsearch, Mongo, Kubernetes, BigTable, etc.
Unify disparate data sources and files into a single Federated view of your data and query with SQL without copying into datawarehouse.
Mysql compatible federated query engine to Elasticsearch, Mongo,
Google Datastore, Cassandra, Google BigTable, Kubernetes, file-based sources.
This query engine hosts a mysql protocol listener,
which rewrites sql queries to native (elasticsearch, mongo, cassandra, kuberntes-rest-api, bigtable).
It works by implementing a full relational algebra distributed execution engine
to run sql queries and poly-fill missing features
from underlying sources. So, a backend key-value storage such as cassandra
can now have complete WHERE clause support as well as aggregate functions etc.
Most similar to prestodb but in Golang, and focused on easy to add custom data sources as well as REST api sources.
Storage Sources
- Google Big Table SQL against big-table Bigtable.
- Elasticsearch Simplify access to Elasticsearch.
- Mongo Translate SQL into mongo.
- Google Cloud Storage / (csv, json files) An example of REST api backends (list of files), as well as the file contents themselves are tables.
- Cassandra SQL against cassandra. Adds sql features that are missing.
- Lytics SQL against Lytics REST Api's
- Kubernetes An example of REST api backend.
- Google Big Query MYSQL against worlds best analytics datawarehouse BigQuery.
- Google Datastore MYSQL against Datastore.
Features
- Distributed run queries across multiple servers
- Hackable Sources Very easy to add a new Source for your custom data, files, json, csv, storage.
- Hackable Functions Add custom go functions to extend the sql language.
- Joins Get join functionality between heterogeneous sources.
- Frontends currently only MySql protocol is supported but RethinkDB (for real-time api) is planned, and are pluggable.
- Backends Elasticsearch, Google-Datastore, Mongo, Cassandra, BigTable, Kubernetes currently implemented. Csv, Json files, and custom formats (protobuf) are in progress.
Status
- NOT Production ready. Currently supporting a few non-critical use-cases (ad-hoc queries, support tool) in production.
Try it Out
These examples are:
- We are going to create a CSV
databaseof Baseball data from http://seanlahman.com/baseball-archive/statistics/ - Connect to Google BigQuery public datasets (you will need a project, but the free quota will probably keep it free).
# download files to local /tmp
mkdir -p /tmp/baseball
cd /tmp/baseball
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip
unzip bball.zip
mv baseball*/core/*.csv .
rm bball.zip
rm -rf baseballdatabank-*
# run a docker container locally
docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \
-v /tmp/baseball:/tmp/baseball \
gcr.io/dataux-io/dataux:latest
In another Console open Mysql:
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000
-- Now create a new Source
CREATE source baseball WITH {
"type":"cloudstore",
"schema":"baseball",
"settings" : {
"type": "localfs",
"format": "csv",
"path": "baseball/",
"localpath": "/tmp"
}
};
show databases;
use baseball;
show tables;
describe appearances
select count(*) from appearances;
select * from appearances limit 10;
Big Query Example
# assuming you are running local, if you are instead in Google Cloud, or Google Container Engine
# you don't need the credentials or volume mount
docker run -e "GOOGLE_APPLICATION_CREDENTIALS=/.config/gcloud/application_default_credentials.json" \
-e "LOGGING=debug" \
--rm -it \
-p 4000:4000 \
-v ~/.config/gcloud:/.config/gcloud \
gcr.io/dataux-io/dataux:latest
# now that dataux is running use mysql-client to connect
mysql -h 127.0.0.1 -P 4000
now run some queries
-- add a bigquery datasource
CREATE source `datauxtest` WITH {
"type":"bigquery",
"schema":"bqsf_bikes",
"table_aliases" : {
"bikeshare_stations" : "bigquery-public-data:san_francisco.bikeshare_stations"
},
"settings" : {
"billing_project" : "your-google-cloud-project",
"data_project" : "bigquery-public-data",
"dataset" : "san_francisco"
}
};
use bqsf_bikes;
show tables;
describe film_locations;
select * from film_locations limit 10;
Hacking
For now, the goal is to allow this to be used for library, so the
vendor is not checked in. use docker containers or dep for now.
# run dep ensure
dep ensure -v
Related Projects, Database Proxies & Multi-Data QL
- Data-Accessability Making it easier to query, access, share, and use data. Protocol shifting (for accessibility). Sharing/Replication between db types.
- Scalability/Sharding Implement sharding, connection sharing
Name | Scaling | Ease Of Access (sql, etc) | Comments ---- | ------- | ----------------------------- | --------- Vitess | Y | | for scaling (sharding), very mature twemproxy | Y | | for scaling memcache Couchbase N1QL | Y | Y | sql interface to couchbase k/v (and full-text-index) prestodb | | Y | query front end to multiple backends, distributed cratedb | Y | Y | all-in-one db, not a proxy, sql to es codis | Y | | for scaling redis MariaDB MaxScale | Y | | for scaling mysql/mariadb (sharding) mature Netflix Dynomite | Y | | not really sql, just multi-store k/v redishappy | Y | | for scaling redis, haproxy mixer | Y | | simple mysql sharding
We use more and more databases, flatfiles, message queues, etc. For db's the primary reader/writer is fine but secondary readers such as investigating ad-hoc issues means we might be accessing and learning many different query languages.
Credit to mixer, derived mysql connection pieces from it (which was forked from vitess).
Inspiration/Other works
In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.
from Databus
Building
I plan on getting the vendor getting checked in soon so the build will work. However
I am currently trying to figure out how to organize packages to allow use as both a library
as well as a daemon. (see how minimal main.go is, to encourage your own builtins and datasources.)
# for just docker
# ensure /vendor has correct versions
dep ensure -update
# build binary
./.build
# build docker
docker build -t gcr.io/dataux-io/dataux:v0.15.1 .
Related Skills
oracle
339.1kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
xurl
339.1kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
prose
339.1kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
Command Development
83.8kThis skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
