SkillAgentSearch skills...

Iterabledata

Python library to read, write and convert data files with formats BSON, JSON, NDJSON, Parquet, ORC, XLS, XLSX, XML and many others

Install / Use

/learn @datenoio/Iterabledata
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Iterable Data

Iterable Data is a Python library for reading and writing data files row by row in a consistent, iterator-based interface. It provides a unified API for working with various data formats (CSV, JSON, Parquet, XML, etc.) similar to csv.DictReader but supporting many more formats.

This library simplifies data processing and conversion between formats while preserving complex nested data structures (unlike pandas DataFrames which require flattening).

Features

  • Unified API: Single interface for reading/writing multiple data formats
  • Automatic Format Detection: Detects file type and compression from filename or content (magic numbers and heuristics)
  • Format Capability Reporting: Programmatically query format capabilities (read/write/bulk/totals/streaming/tables)
  • Support for Compression: Works seamlessly with compressed files
  • Preserves Nested Data: Handles complex nested structures as Python dictionaries
  • DuckDB Integration: Optional DuckDB engine for high-performance queries with pushdown optimizations
  • Pipeline Processing: Built-in pipeline support for data transformation
  • Encoding Detection: Automatic encoding and delimiter detection for text files
  • Bulk Operations: Efficient batch reading and writing
  • Table Listing: Discover available tables, sheets, and datasets in multi-table formats
  • Context Manager Support: Use with statements for automatic resource cleanup
  • DataFrame Bridges: Convert iterable data to Pandas, Polars, and Dask DataFrames with one-liner methods
  • Cloud Storage Support: Direct access to S3, GCS, and Azure Blob Storage via URI schemes
  • Database Engine Support: Read-only access to SQL and NoSQL databases (PostgreSQL, ClickHouse, MySQL, MongoDB, Elasticsearch, etc.) as iterable data sources
  • Atomic Writes: Production-safe file writing with temporary files and atomic renames
  • Bulk File Conversion: Convert multiple files at once using glob patterns or directories
  • Progress Tracking and Metrics: Built-in progress bars, callbacks, and structured metrics objects
  • Error Handling Controls: Configurable error policies and structured error logging
  • Type Hints and Type Safety: Complete type annotations with typed helper functions for dataclasses and Pydantic models

Supported File Types

Core Formats

  • JSON - Standard JSON files
  • JSONL/NDJSON - JSON Lines format (one JSON object per line)
  • JSON-LD - JSON for Linking Data (RDF format)
  • CSV/TSV - Comma and tab-separated values
  • Annotated CSV - CSV with type annotations and metadata
  • CSVW - CSV on the Web (with metadata)
  • PSV/SSV - Pipe and semicolon-separated values
  • LTSV - Labeled Tab-Separated Values
  • FWF - Fixed Width Format
  • XML - XML files with configurable tag parsing
  • ZIP XML - XML files within ZIP archives
  • HTML - HTML files with table extraction

Binary Formats

  • BSON - Binary JSON format
  • MessagePack - Efficient binary serialization
  • CBOR - Concise Binary Object Representation
  • UBJSON - Universal Binary JSON
  • SMILE - Binary JSON variant
  • Bencode - BitTorrent encoding format
  • Avro - Apache Avro binary format
  • Pickle - Python pickle format

Columnar & Analytics Formats

  • Parquet - Apache Parquet columnar format
  • ORC - Optimized Row Columnar format
  • Arrow/Feather - Apache Arrow columnar format
  • Lance - Modern columnar format optimized for ML and vector search
  • Vortex - Modern columnar format with fast random access
  • Delta Lake - Delta Lake format
  • Iceberg - Apache Iceberg format
  • Hudi - Apache Hudi format

Database Formats

  • SQLite - SQLite database files
  • DBF - dBase/FoxPro database files
  • MySQL Dump - MySQL dump files
  • PostgreSQL Copy - PostgreSQL COPY format
  • DuckDB - DuckDB database files

Statistical Formats

  • SAS - SAS data files
  • Stata - Stata data files
  • SPSS - SPSS data files
  • R Data - R RDS and RData files
  • PX - PC-Axis format
  • ARFF - Attribute-Relation File Format (Weka format)

Scientific Formats

  • NetCDF - Network Common Data Form for scientific data
  • HDF5 - Hierarchical Data Format

Geospatial Formats

  • GeoJSON - Geographic JSON format
  • GeoPackage - OGC GeoPackage format
  • GML - Geography Markup Language
  • KML - Keyhole Markup Language
  • Shapefile - ESRI Shapefile format
  • MVT/PBF - Mapbox Vector Tiles
  • TopoJSON - Topology-preserving GeoJSON extension

RDF & Semantic Formats

  • JSON-LD - JSON for Linking Data
  • RDF/XML - RDF in XML format
  • Turtle - Terse RDF Triple Language
  • N-Triples - Line-based RDF format
  • N-Quads - N-Triples with context

Feed Formats

  • Atom - Atom Syndication Format
  • RSS - Rich Site Summary feed format

Network Formats

  • PCAP - Packet Capture format
  • PCAPNG - PCAP Next Generation format

Log & Event Formats

  • Apache Log - Apache access/error logs
  • CEF - Common Event Format
  • GELF - Graylog Extended Log Format
  • WARC - Web ARChive format
  • CDX - Web archive index format
  • ILP - InfluxDB Line Protocol
  • HTML - HTML files with table extraction

Email Formats

  • EML - Email message format
  • MBOX - Mailbox format
  • MHTML - MIME HTML format

Configuration Formats

  • INI - INI configuration files
  • TOML - Tom's Obvious Minimal Language
  • YAML - YAML Ain't Markup Language
  • HOCON - Human-Optimized Config Object Notation
  • EDN - Extensible Data Notation

Office Formats

  • XLS/XLSX - Microsoft Excel files
  • ODS - OpenDocument Spreadsheet

CAD Formats

  • DXF - AutoCAD Drawing Exchange Format

Streaming & Big Data Formats

  • Kafka - Apache Kafka format
  • Pulsar - Apache Pulsar format
  • Flink - Apache Flink format
  • Beam - Apache Beam format
  • RecordIO - RecordIO format
  • SequenceFile - Hadoop SequenceFile
  • TFRecord - TensorFlow Record format

Protocol & Serialization Formats

  • Protocol Buffers - Google Protocol Buffers
  • Cap'n Proto - Cap'n Proto serialization
  • FlatBuffers - FlatBuffers serialization
  • FlexBuffers - FlexBuffers format
  • Thrift - Apache Thrift format
  • ASN.1 - ASN.1 encoding format
  • Ion - Amazon Ion format

Other Formats

  • VCF - Variant Call Format (genomics)
  • iCal - iCalendar format
  • LDIF - LDAP Data Interchange Format
  • TXT - Plain text files

Supported Compression Codecs

  • GZip (.gz)
  • BZip2 (.bz2)
  • LZMA (.xz, .lzma)
  • LZ4 (.lz4)
  • ZIP (.zip)
  • Brotli (.br)
  • ZStandard (.zst, .zstd)
  • Snappy (.snappy, .sz)
  • LZO (.lzo, .lzop)
  • SZIP (.sz)
  • 7z (.7z)

Requirements

Python 3.10+

Installation

pip install iterabledata

Or install from source:

git clone https://github.com/datenoio/iterabledata.git
cd pyiterable
pip install .

Optional Dependencies

IterableData supports optional extras for additional features:

# AI-powered documentation generation
pip install iterabledata[ai]

# Database ingestion (PostgreSQL, ClickHouse, MongoDB, MySQL, Elasticsearch, etc.)
pip install iterabledata[db]

# All optional dependencies
pip install iterabledata[all]

AI Features ([ai]): Enables AI-powered documentation generation using OpenAI, OpenRouter, Ollama, LMStudio, or Perplexity.

Database Engines ([db]): Enables read-only database access as iterable data sources. Supports PostgreSQL, ClickHouse (available), MySQL/MariaDB, Microsoft SQL Server, SQLite, MongoDB, and Elasticsearch/OpenSearch (planned). Includes convenience groups:

  • [db-sql]: SQL databases only (PostgreSQL, ClickHouse, MySQL, MSSQL)
  • [db-nosql]: NoSQL databases only (MongoDB, Elasticsearch)

See the API documentation for details on these features.

Quick Start

Basic Reading

from iterable.helpers.detect import open_iterable

# Automatically detects format and compression
# Using context manager (recommended)
with open_iterable('data.csv.gz') as source:
    for row in source:
        print(row)
        # Process your data here
# File is automatically closed

# Or manually (still supported)
source = open_iterable('data.csv.gz')
for row in source:
    print(row)
source.close()

Writing Data

from iterable.helpers.detect import open_iterable

# Write compressed JSONL file
# Using context manager (recommended)
with open_iterable('output.jsonl.zst', mode='w') as dest:
    for item in my_data:
        dest.write(item)
# File is automatically closed

# Or manually (still supported)
dest = open_iterable('output.jsonl.zst', mode='w')
for item in my_data:
    dest.write(item)
dest.close()

Usage Examples

Reading Compressed CSV Files

from iterable.helpers.detect import open_iterable

# Read compressed CSV file (supports .gz, .bz2, .xz, .zst, .lz4, .br, .snappy, .lzo)
source = open_iterable('data.csv.xz')
n = 0
for row in source:
    n += 1
    # Process row data
    if n % 1000 == 0:
        print(f'Processed {n} rows')
source.close()

Reading Different Formats

from iterable.helpers.detect import open_iterable

# Read JSONL file
jsonl_file = open_iterable('data.jsonl')
for row in jsonl_file:
    print(row)
jsonl_file.close()

# Read Parquet file
parquet_file = open_iterable('data.parquet')
for row in parquet_file:
    print(row)
parquet_file.close()

# Read XML file (specify tag name)
xml_file = open_iterable('data.xml', iterableargs={'tagname': 'item'})
for row in xml_file:
    print(row)
xml_file.close()

# Read Excel file
xlsx_file = open_iterable('data.xlsx')
for row in xlsx_file:
    print(row)
xlsx_file.close()

Reading from Databases

from iterable.helpers.detect import open_iterable

# Read from PostgreSQL database
with ope

Related Skills

View on GitHub
GitHub Stars30
CategoryDevelopment
Updated18d ago
Forks1

Languages

Python

Security Score

95/100

Audited on Mar 15, 2026

No findings