Iterabledata

Python library to read, write and convert data files with formats BSON, JSON, NDJSON, Parquet, ORC, XLS, XLSX, XML and many others

Generate Convert Improve

Install / Use

/learn @datenoio/Iterabledata

About this skill

Quality Score

0/100

README

Iterable Data

Iterable Data is a Python library for reading and writing data files row by row in a consistent, iterator-based interface. It provides a unified API for working with various data formats (CSV, JSON, Parquet, XML, etc.) similar to csv.DictReader but supporting many more formats.

This library simplifies data processing and conversion between formats while preserving complex nested data structures (unlike pandas DataFrames which require flattening).

Features

Unified API: Single interface for reading/writing multiple data formats
Automatic Format Detection: Detects file type and compression from filename or content (magic numbers and heuristics)
Format Capability Reporting: Programmatically query format capabilities (read/write/bulk/totals/streaming/tables)
Support for Compression: Works seamlessly with compressed files
Preserves Nested Data: Handles complex nested structures as Python dictionaries
DuckDB Integration: Optional DuckDB engine for high-performance queries with pushdown optimizations
Pipeline Processing: Built-in pipeline support for data transformation
Encoding Detection: Automatic encoding and delimiter detection for text files
Bulk Operations: Efficient batch reading and writing
Table Listing: Discover available tables, sheets, and datasets in multi-table formats
Context Manager Support: Use with statements for automatic resource cleanup
DataFrame Bridges: Convert iterable data to Pandas, Polars, and Dask DataFrames with one-liner methods
Cloud Storage Support: Direct access to S3, GCS, and Azure Blob Storage via URI schemes
Database Engine Support: Read-only access to SQL and NoSQL databases (PostgreSQL, ClickHouse, MySQL, MongoDB, Elasticsearch, etc.) as iterable data sources
Atomic Writes: Production-safe file writing with temporary files and atomic renames
Bulk File Conversion: Convert multiple files at once using glob patterns or directories
Progress Tracking and Metrics: Built-in progress bars, callbacks, and structured metrics objects
Error Handling Controls: Configurable error policies and structured error logging
Type Hints and Type Safety: Complete type annotations with typed helper functions for dataclasses and Pydantic models

Supported File Types

Core Formats

JSON - Standard JSON files
JSONL/NDJSON - JSON Lines format (one JSON object per line)
JSON-LD - JSON for Linking Data (RDF format)
CSV/TSV - Comma and tab-separated values
Annotated CSV - CSV with type annotations and metadata
CSVW - CSV on the Web (with metadata)
PSV/SSV - Pipe and semicolon-separated values
LTSV - Labeled Tab-Separated Values
FWF - Fixed Width Format
XML - XML files with configurable tag parsing
ZIP XML - XML files within ZIP archives
HTML - HTML files with table extraction

Binary Formats

BSON - Binary JSON format
MessagePack - Efficient binary serialization
CBOR - Concise Binary Object Representation
UBJSON - Universal Binary JSON
SMILE - Binary JSON variant
Bencode - BitTorrent encoding format
Avro - Apache Avro binary format
Pickle - Python pickle format

Columnar & Analytics Formats

Parquet - Apache Parquet columnar format
ORC - Optimized Row Columnar format
Arrow/Feather - Apache Arrow columnar format
Lance - Modern columnar format optimized for ML and vector search
Vortex - Modern columnar format with fast random access
Delta Lake - Delta Lake format
Iceberg - Apache Iceberg format
Hudi - Apache Hudi format

Database Formats

SQLite - SQLite database files
DBF - dBase/FoxPro database files
MySQL Dump - MySQL dump files
PostgreSQL Copy - PostgreSQL COPY format
DuckDB - DuckDB database files

Statistical Formats

SAS - SAS data files
Stata - Stata data files
SPSS - SPSS data files
R Data - R RDS and RData files
PX - PC-Axis format
ARFF - Attribute-Relation File Format (Weka format)

Scientific Formats

NetCDF - Network Common Data Form for scientific data
HDF5 - Hierarchical Data Format

Geospatial Formats

GeoJSON - Geographic JSON format
GeoPackage - OGC GeoPackage format
GML - Geography Markup Language
KML - Keyhole Markup Language
Shapefile - ESRI Shapefile format
MVT/PBF - Mapbox Vector Tiles
TopoJSON - Topology-preserving GeoJSON extension

RDF & Semantic Formats

JSON-LD - JSON for Linking Data
RDF/XML - RDF in XML format
Turtle - Terse RDF Triple Language
N-Triples - Line-based RDF format
N-Quads - N-Triples with context

Feed Formats

Atom - Atom Syndication Format
RSS - Rich Site Summary feed format

Network Formats

PCAP - Packet Capture format
PCAPNG - PCAP Next Generation format

Log & Event Formats

Apache Log - Apache access/error logs
CEF - Common Event Format
GELF - Graylog Extended Log Format
WARC - Web ARChive format
CDX - Web archive index format
ILP - InfluxDB Line Protocol
HTML - HTML files with table extraction

Email Formats

EML - Email message format
MBOX - Mailbox format
MHTML - MIME HTML format

Configuration Formats

INI - INI configuration files
TOML - Tom's Obvious Minimal Language
YAML - YAML Ain't Markup Language
HOCON - Human-Optimized Config Object Notation
EDN - Extensible Data Notation

Office Formats

XLS/XLSX - Microsoft Excel files
ODS - OpenDocument Spreadsheet

CAD Formats

DXF - AutoCAD Drawing Exchange Format

Streaming & Big Data Formats

Kafka - Apache Kafka format
Pulsar - Apache Pulsar format
Flink - Apache Flink format
Beam - Apache Beam format
RecordIO - RecordIO format
SequenceFile - Hadoop SequenceFile
TFRecord - TensorFlow Record format

Protocol & Serialization Formats

Protocol Buffers - Google Protocol Buffers
Cap'n Proto - Cap'n Proto serialization
FlatBuffers - FlatBuffers serialization
FlexBuffers - FlexBuffers format
Thrift - Apache Thrift format
ASN.1 - ASN.1 encoding format
Ion - Amazon Ion format

Other Formats

VCF - Variant Call Format (genomics)
iCal - iCalendar format
LDIF - LDAP Data Interchange Format
TXT - Plain text files

Supported Compression Codecs

GZip (.gz)
BZip2 (.bz2)
LZMA (.xz, .lzma)
LZ4 (.lz4)
ZIP (.zip)
Brotli (.br)
ZStandard (.zst, .zstd)
Snappy (.snappy, .sz)
LZO (.lzo, .lzop)
SZIP (.sz)
7z (.7z)

Requirements

Python 3.10+

Installation

pip install iterabledata

Or install from source:

git clone https://github.com/datenoio/iterabledata.git
cd pyiterable
pip install .

Optional Dependencies

IterableData supports optional extras for additional features:

# AI-powered documentation generation
pip install iterabledata[ai]

# Database ingestion (PostgreSQL, ClickHouse, MongoDB, MySQL, Elasticsearch, etc.)
pip install iterabledata[db]

# All optional dependencies
pip install iterabledata[all]

AI Features ([ai]): Enables AI-powered documentation generation using OpenAI, OpenRouter, Ollama, LMStudio, or Perplexity.

Database Engines ([db]): Enables read-only database access as iterable data sources. Supports PostgreSQL, ClickHouse (available), MySQL/MariaDB, Microsoft SQL Server, SQLite, MongoDB, and Elasticsearch/OpenSearch (planned). Includes convenience groups:

[db-sql]: SQL databases only (PostgreSQL, ClickHouse, MySQL, MSSQL)
[db-nosql]: NoSQL databases only (MongoDB, Elasticsearch)

See the API documentation for details on these features.

Quick Start

Basic Reading

from iterable.helpers.detect import open_iterable

# Automatically detects format and compression
# Using context manager (recommended)
with open_iterable('data.csv.gz') as source:
    for row in source:
        print(row)
        # Process your data here
# File is automatically closed

# Or manually (still supported)
source = open_iterable('data.csv.gz')
for row in source:
    print(row)
source.close()

Writing Data

from iterable.helpers.detect import open_iterable

# Write compressed JSONL file
# Using context manager (recommended)
with open_iterable('output.jsonl.zst', mode='w') as dest:
    for item in my_data:
        dest.write(item)
# File is automatically closed

# Or manually (still supported)
dest = open_iterable('output.jsonl.zst', mode='w')
for item in my_data:
    dest.write(item)
dest.close()

Usage Examples

Reading Compressed CSV Files

from iterable.helpers.detect import open_iterable

# Read compressed CSV file (supports .gz, .bz2, .xz, .zst, .lz4, .br, .snappy, .lzo)
source = open_iterable('data.csv.xz')
n = 0
for row in source:
    n += 1
    # Process row data
    if n % 1000 == 0:
        print(f'Processed {n} rows')
source.close()

Reading Different Formats

from iterable.helpers.detect import open_iterable

# Read JSONL file
jsonl_file = open_iterable('data.jsonl')
for row in jsonl_file:
    print(row)
jsonl_file.close()

# Read Parquet file
parquet_file = open_iterable('data.parquet')
for row in parquet_file:
    print(row)
parquet_file.close()

# Read XML file (specify tag name)
xml_file = open_iterable('data.xml', iterableargs={'tagname': 'item'})
for row in xml_file:
    print(row)
xml_file.close()

# Read Excel file
xlsx_file = open_iterable('data.xlsx')
for row in xlsx_file:
    print(row)
xlsx_file.close()

Reading from Databases

from iterable.helpers.detect import open_iterable

# Read from PostgreSQL database
with ope

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。