Pgspider
High-Performance SQL Cluster Engine for Scalable Data Virtualization
Install / Use
/learn @pgspider/PgspiderREADME
PGSpider
PGSpider is High-Performance SQL Cluster Engine for distributed big data.
PGSpider can access a number of data sources using Foreign Data Wrapper(FDW) and retrieves the distributed data source vertically.
Usage of PGSpider is the same as PostgreSQL except its program name is pgspider and default port number is 4813. You can use any client applications such as libpq and psql.
Features
-
Multi-Tenant
User can get records in multi tables by one SQL easily.
If there are tables with similar schema in each data source, PGSpider can view them as a single virtual table: We call it as Multi-Tenant table. -
Modification
User can modify data at Multi-Tenant table by using INSERT/UPDATE/DELETE query.
For INSERT feature, PGSpider will use round robin method to choose 1 alive node that supports INSERT feature and is the next to the previous target as rotation to INSERT data.
For UPDATE/DELETE feature, PGSpider will execute UPDATE/DELETE at all alive nodes that support UPDATE/DELETE feature.
PGSpider supports both Direct and Foreign Modification.
PGSpider supports bulk INSERT by using batch_size option.- If user specifies batch size option, we can get batch size from foreign table or foreign server option.
- If batch_size is 1, tell child nodes to execute simple insert. Otherwise, execute batch insert if child node can do.
- If batch_size is not specified by user on multi-tenant table, automatically calculation based on batch size of child tables and the number of child nodes using LCM method (Least Common Multiple).
If batch size is too large, we use the limited value (6553500) as batch size.
PGSpider distributes records to data sources evenly, not only for one query but also for many queries.
-
Parallel processing
PGSpider executes queries and fetches results from child nodes in parallel.
PGSpider expands Multi-Tenant table to child tables, creates new threads for each child table to access corresponding data source. -
Pushdown
WHERE clause and aggregation functions are pushed down to child nodes.
Pushdown to Multi-tenant tables occur error when AVG, STDDEV and VARIANCE are used.
PGSPider improves this error, PGSpider can execute them. -
Data Compression Transfer PGSpider support transferring data to other datasource via Cloud Function.
Data will be compressed, transmitted to Cloud Function, and then transfered to data source.
This feature helps PGSpider control and reduce the size of transferred data between PGSpider and destination data source, lead to reduce the usage fee on cloud service
How to build PGSpider
Clone PGSpider source code.
<pre> git clone https://github.com/pgspider/pgspider.git </pre>Build and install PGSpider and extensions.
<pre> cd pgspider ./configure make sudo make install cd contrib/pgspider_core_fdw make sudo make install cd ../pgspider_fdw make sudo make install </pre>Default install directory is /usr/local/pgspider.
Usage
For example, we will create 2 different child nodes, SQLite and PostgreSQL. They are accessed by PGSpider as root node. Please install SQLite and PostgreSQL for child nodes.
After that, we install PostgreSQL FDW and SQLite FDW into PGSpider.
Install SQLite FDW
<pre> cd ../ git clone https://github.com/pgspider/sqlite_fdw.git cd sqlite_fdw make sudo make install </pre>Install PostgreSQL FDW
<pre> cd ../postgres_fdw make sudo make install </pre>Start PGSpider
PGSpider binary name is same as PostgreSQL.
Default install directory is changed.
Create database cluster and start server.
<pre> cd /usr/local/pgspider/bin ./initdb -D ~/pgspider_db ./pg_ctl -D ~/pgspider_db start ./createdb pgspider </pre>Connect to PGSpider.
<pre> ./psql pgspider </pre>Load extension
PGSpider (Parent node)
<pre> CREATE EXTENSION pgspider_core_fdw; </pre>PostgreSQL, SQLite (Child node)
<pre> CREATE EXTENSION postgres_fdw; CREATE EXTENSION sqlite_fdw; </pre>Create server
PGSpider (Parent node)
<pre> CREATE SERVER parent FOREIGN DATA WRAPPER pgspider_core_fdw; </pre>PostgreSQL, SQLite (Child node)
In this example, child PostgreSQL node is localhost and port is 5432.
SQLite node's database is /tmp/temp.db.
Create user mapping
PostgreSQL (Child node)
<pre> CREATE USER MAPPING FOR CURRENT_USER SERVER postgres_svr OPTIONS(user 'user', password 'pass'); </pre>SQLite (Child node)
No need to create user mapping.
Create Multi-Tenant table
PGSpider (Parent node)
You need to declare a column named "__spd_url" on parent table.
This column is node location in PGSpider. It allows you to know where the data is comming from node.
In this example, we define 't1' table to get data from PostgreSQL node and SQLite node.
When expanding Multi-Tenant table to data source tables, PGSpider searches child node tables by name having [Multi-Tenant table name]__[data source name]__0.
PostgreSQL, SQLite (Child node)
<pre> CREATE FOREIGN TABLE t1__postgres_svr__0(i int, t text) SERVER postgres_svr OPTIONS (table_name 't1'); CREATE FOREIGN TABLE t1__sqlite_svr__0(i int OPTIONS (key 'true'), t text) SERVER sqlite_svr OPTIONS (table 't1'); </pre>Access Multi-Tenant table
<pre> SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 1 | aaa | /sqlite_svr/ 2 | bbb | /sqlite_svr/ 10 | a | /postgres_svr/ 11 | b | /postgres_svr/ (4 rows) </pre>Access Multi-Tenant table using node filter
You can choose getting node with 'IN' clause after FROM items (Table name).
<pre> SELECT * FROM t1 IN ('/postgres_svr/'); i | t | __spd_url ----+---+---------------- 10 | a | /postgres_svr/ 11 | b | /postgres_svr/ (2 rows) </pre>Modify Multi-Tenant table
<pre> SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 1 | aaa | /sqlite_svr/ 10 | a | /postgres_svr/ 2 | bbb | /sqlite_svr/ 11 | b | /postgres_svr/ (4 rows) INSERT INTO t1 VALUES (4, 'c'); INSERT 0 1 SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 10 | a | /postgres_svr/ 11 | b | /postgres_svr/ 1 | aaa | /sqlite_svr/ 2 | bbb | /sqlite_svr/ 4 | c | /postgres_svr/ (5 rows) UPDATE t1 SET t = 'nn'; UPDATE 5 SELECT * FROM t1; i | t | __spd_url ----+----+---------------- 10 | nn | /postgres_svr/ 11 | nn | /postgres_svr/ 1 | nn | /sqlite_svr/ 2 | nn | /sqlite_svr/ 4 | nn | /postgres_svr/ (5 rows) DELETE FROM t1; DELETE 5 SELECT * FROM t1; i | t | __spd_url ---+---+----------- (0 rows) </pre>Modify Multi-Tenant table using node filter
You can choose modifying node with 'IN' clause after table name.
<pre> INSERT INTO t1 VALUES (1, 'aaa'), (11, 'b'); SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 1 | aaa | /sqlite_svr/ 11 | b | /postgres_svr/ (2 rows) INSERT INTO t1 IN ('/postgres_svr/') VALUES (4, 'c'); SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 1 | aaa | /sqlite_svr/ 4 | c | /postgres_svr/ 11 | b | /postgres_svr/ (3 rows) UPDATE t1 IN ('/postgres_svr/') SET t = 'xxx'; UPDATE 2 SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 11 | xxx | /postgres_svr/ 1 | aaa | /sqlite_svr/ 4 | xxx | /postgres_svr/ (3 rows) DELETE FROM t1 IN ('/sqlite_svr/'); DELETE 1 SELECT * FROM t1; i | t | __spd_url ----+-----+---------------- 11 | xxx | /postgres_svr/ 4 | xxx | /postgres_svr/ (2 rows) </pre>Tree Structure
PGSpider can get data from child PGSpider, it means PGSpider can create tree structure.
For example, we will create a new PGSpider as root node which connects to PGSpider of previous example.
The new root node is parent of previous PGSpider node.
Start new root PGSpider
Create new database cluster with initdb and change port number.
After that, start and connect to new root node.
Load extension
PGSpider (new root node)
If child node is PGSpider, PGSpider use pgspider_fdw.
Create server
PGSpider (new root node)
<pre> CREATE SERVER new_root FOREIGN DATA WRAPPER pgspider_core_fdw; </pre>PGSpider (Parent node)
<pre> CREATE SERVER parent FOREIGN DATA WRAPPER pgspider_svr OPTIONS (host '127.0.0.1', port '4813') ; </pre>Create user mapping
PGSpider (Parent node)
<pre> CREATE USER MAPPING FOR CURRENT_USER SERVER parent OPTIONS(user 'user', password 'pass'); </pre>Create Multi-Tenant table
PGSpider (new root node)
<pre> CREATE FOREIGN TABLE t1(i int, t text, __spd_url text) SERVER new_root; </pre>PGSpider (Parent node)
<pre> CREATE FOREIGN TABLE t1__parent__0(i int, t text, __spd_url text) SERVER parent; </pre>Access Multi-Tenant table
<pre> SELECT * FROM t1; i | t | __spd_url ----+-----+----------------------- 1 | aaa | /parent/sqlite_svr/ 2 | bbb | /parent/sqlite_svr/ 10 | a | /parent/postgres_svr/ 11 | b | /parent/postgres_svr/ (4 rows) </pre>Create/Drop datasource table
According to the information of a foreign table, you can create/drop a table on remote database.
- The query syntax: <pre> CREATE DATASOURCE TABLE [ IF NOT EXISTS ] table_name; DROP DATASOURCE TABLE [ IF EXISTS ] table_name; </pre>
- Parameters:
- IF NOT EXISTS (in CREATE DATASOURCE TABLE)
Do not throw any error if a relation/table with the same name with datasource table already exists in remote server. Note that there is no guarantee that the existing datasouce table is anything like the one that would have been created. - IF EXISTS (in DROP DATASOURCE TABLE)
Do no
- IF NOT EXISTS (in CREATE DATASOURCE TABLE)
