Sql2arrow
SQL2Arrow, short for 'SQL to Arrow,' is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It is particularly useful for analyzing data dumped by mysqldump or other tools.
Install / Use
/learn @goalzz85/Sql2arrowREADME
SQL2Arrow
This is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It's very useful for analyzing data dumped by mysqldump or other tools.
How to use
Installation
Install the latest SQL2arrow version with:
pip install sql2arrow
Parsing SQL str
import sql2arrow
sql_str = '''
INSERT INTO `region` VALUES
('', '', '2023-01-31 18:00:48', '2023-01-31 18:00:48', ''),
('1541947646568607746', 'region name', '2022-06-29 08:52:21', '2022-06-29 08:52:21', 'D99'),
('1541947680890597378', 'region name1', '2022-06-29 08:52:29', '2022-06-29 08:52:29', 'D98'),
('620422117205', 'region name7', '2021-10-25 18:23:48', '2021-10-25 18:23:48', 'D620422117');
'''
columns = [
sql2arrow.Column("region_code", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("region_name", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("create_time", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("update_time", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("parent_region_code", sql2arrow.ArrowTypes.utf8())
]
arrow_data = sql2arrow.parse_sql(sql_str, columns)
Parsing sql files
import sql2arrow
sql_paths = [
"region.sql_0.gz", "region.sql_1.gz","region.sql_2.gz","region.sql_3.gz","region.sql_4.gz","region.sql_5.gz","region.sql_6.gz"
]
columns = [
sql2arrow.Column("region_code", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("region_name", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("create_time", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("update_time", sql2arrow.ArrowTypes.utf8()),
sql2arrow.Column("parent_region_code", sql2arrow.ArrowTypes.utf8())
]
partition_func_spec = sql2arrow.partition.IcebergPartitionFuncSpec()
partition_func_spec.add_partition("region_code", sql2arrow.partition.IcebergTransforms.bucket(30))
it = sql2arrow.SQLFile2ArrowIter(
sql_paths,
columns,
4,
1000,
sql2arrow.CompressionType.SNAPPY,
sql2arrow.Dialect.MYSQL,
partition_func_spec
)
for arr in it:
print(arr)
arro3
SQL2Arrow uses arro3 as the default Python library for Apache Arrow. Thanks to the Arrow PyCapsule Interface, we can seamlessly pass Arro3's Array data to other libraries compatible with the Arrow PyCapsule Interface, including PyArrow, Polars (v1.2+), Pandas (v2.2+), NanoArrow, and more, all with zero-copy memory.
# some codes from above
import pyarrow as pa
tables = [pa.Table.from_arrays(a, names=names) for a in arrs]
Limitations
Dialect
It currently supports only MySQL and PostgreSQL INSERT statements.
Related Skills
himalaya
337.4kCLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).
oracle
337.4kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
prose
337.4kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
claude-opus-4-5-migration
83.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
