SkillAgentSearch skills...

Exstruct

Conversion from Excel to structured JSON (tables, shapes, charts) for LLM/RAG pipelines, and autonomous Excel reading/writing by AI agents via CLI and MCP integration.

Install / Use

/learn @harumiWeb/Exstruct

README

<p align="center"> <a href="https://harumiweb.github.io/exstruct/"> <img width="600" alt="ExStruct Logo" src="https://github.com/user-attachments/assets/c1d4e616-890f-435c-9d53-fba054f861a8" /> </a> </p> <p align="center"> <em>Excel Structured Extraction Engine</em> </p> <div align="center" style="max-width: 600px; margin: auto;">

PyPI version PyPI Downloads Licence: BSD-3-Clause pytest Codacy Badge codecov Ask DeepWiki GitHub Repo stars

</div> <p align="center"> <a href="README.md"> English </a> | <a href="README.ja.md"> 日本語 </a> </p>

ExStruct — Excel Structured Extraction Engine

ExStruct reads Excel workbooks into structured data and applies patch-based editing workflows through a shared core. It provides extraction APIs, a JSON-first editing CLI, and an MCP server for host-managed integrations, with options tuned for LLM/RAG preprocessing, reviewable edit flows, and local automation.

  • In COM/Excel environments (Windows), it performs rich extraction.
  • In non-COM environments (Linux/macOS):
    • if the LibreOffice runtime is available, it performs best-effort extraction for cells, table candidates, shapes, connectors, and charts
    • otherwise, it safely falls back to cells + table candidates + print areas

Detection heuristics, editing workflows, and output modes are adjustable for LLM/RAG pipelines and local automation.

Choose an Interface

| Use case | Recommended interface | Why | | --- | --- | --- | | Write direct Python Excel-editing code | openpyxl / xlwings | Usually the better fit for imperative Python editing. Reach for exstruct.edit only when you specifically want ExStruct's patch contract in Python. | | Run local operator or AI-agent edit workflows | exstruct patch, make, ops, validate | Canonical operational interface; JSON-first and dry-run friendly. | | Run sandboxed or host-managed integrations | exstruct-mcp / MCP tools | Integration / compatibility layer that owns PathPolicy, transport, and artifact behavior. |

Extraction keeps the existing top-level Python API (extract, process_excel, ExStructEngine) and the legacy exstruct INPUT.xlsx ... CLI entrypoint.

Main Features

  • Excel -> structured JSON: outputs cells, shapes, charts, SmartArt, table candidates, merged-cell ranges, print areas, and auto page-break areas by sheet or by area.
  • Output modes: light (cells + table candidates + print areas only), libreoffice (best-effort non-COM mode for .xlsx/.xlsm; adds merged cells, shapes, connectors, and charts when the LibreOffice runtime is available), standard (Excel COM mode with texted shapes + arrows, charts, SmartArt, and merged-cell ranges), verbose (all shapes with width/height plus cell hyperlinks).
  • Formula extraction: emits formulas_map (formula string -> cell coordinates) via openpyxl/COM. It is enabled by default in verbose and can be controlled with include_formulas_map.
  • Formats: JSON (compact by default, --pretty for formatting), YAML, and TOON (optional dependencies).
  • Backend metadata is opt-in: shape/chart provenance, approximation_level, and confidence are omitted from serialized output by default. Enable them with --include-backend-metadata or include_backend_metadata=True.
  • Workbook editing interfaces: use the editing CLI for primary ExStruct edit flows, keep MCP for host-owned safety controls, and use exstruct.edit only when you need the same patch contract from Python.
  • Table detection tuning: heuristics can be adjusted dynamically through the API.
  • Hyperlink extraction: in verbose mode, or with include_cell_links=True, cell links are emitted in links.
  • CLI rendering: in standard / verbose, PDF and sheet images can be generated when Excel COM is available.
  • Safe fallback: if Excel COM or the LibreOffice runtime is unavailable, the process does not crash and falls back to cells + table candidates + print areas.

Installation

pip install exstruct

Optional extras:

  • YAML: pip install pyyaml
  • TOON: pip install python-toon
  • Rendering (PDF/PNG): Excel + pip install pypdfium2 pillow (mode=libreoffice is not supported)
  • Install everything at once: pip install exstruct[yaml,toon,render]

Platform note:

  • Full COM extraction for shapes/charts targets Windows + Excel (xlwings/COM). On Linux/macOS/server environments, use mode=libreoffice as the best-effort rich mode or mode=light for minimal extraction. .xls is not supported in mode=libreoffice.
  • On Debian/Ubuntu/WSL, install LibreOffice together with python3-uno. ExStruct probes a compatible system Python automatically for mode=libreoffice; if your environment needs an explicit interpreter, set EXSTRUCT_LIBREOFFICE_PYTHON_PATH=/usr/bin/python3.
  • LibreOffice Python detection now runs the bundled bridge in --probe mode before selection. An incompatible EXSTRUCT_LIBREOFFICE_PYTHON_PATH fails fast instead of surfacing a delayed bridge SyntaxError during extraction.
  • If the isolated temporary LibreOffice profile fails before the UNO socket becomes ready, ExStruct retries once with the shared/default LibreOffice profile as a compatibility fallback and reports per-attempt startup detail if both launches fail.
  • GitHub Actions includes dedicated LibreOffice smoke jobs on ubuntu-24.04 and windows-2025. Linux installs libreoffice + python3-uno; Windows installs libreoffice-fresh, sets EXSTRUCT_LIBREOFFICE_PATH, and both jobs run tests/core/test_libreoffice_smoke.py with RUN_LIBREOFFICE_SMOKE=1.

Quick Start CLI

exstruct input.xlsx > output.json          # compact JSON to stdout by default
exstruct input.xlsx -o out.json --pretty   # write pretty JSON to a file
exstruct input.xlsx --format yaml          # YAML (requires pyyaml)
exstruct input.xlsx --format toon          # TOON (requires python-toon)
exstruct input.xlsx --sheets-dir sheets/   # write one file per sheet
exstruct input.xlsx --auto-page-breaks-dir auto_areas/  # always shown; execution requires standard/verbose + Excel COM
exstruct input.xlsx --alpha-col            # output column keys as A, B, ..., AA
exstruct input.xlsx --include-backend-metadata  # include shape/chart backend metadata
exstruct input.xlsx --mode light           # cells + table candidates only
exstruct input.xlsx --mode libreoffice     # best-effort extraction of shapes/connectors/charts without COM
exstruct input.xlsx --pdf --image          # PDF and PNGs (Excel COM required)

Auto page-break export is available from both the API and the CLI when Excel/COM is available. The CLI always exposes --auto-page-breaks-dir, but validates it at execution time. mode=libreoffice rejects --pdf, --image, and --auto-page-breaks-dir early, and mode=light also rejects --auto-page-breaks-dir. Use standard or verbose with Excel COM for those features. By default, the CLI keeps legacy 0-based numeric string column keys ("0", "1", ...). Use --alpha-col when you need Excel-style keys ("A", "B", ...). By default, serialized shape/chart output omits backend metadata (provenance, approximation_level, confidence) to reduce token usage. Use --include-backend-metadata or the corresponding Python/MCP option when you need it. Note: MCP exstruct_extract defaults to options.alpha_col=true, which differs from the CLI default (false).

Quick Start Editing CLI

exstruct patch --input book.xlsx --ops ops.json --backend openpyxl
exstruct patch --input book.xlsx --ops - --dry-run --pretty < ops.json
exstruct make --output new.xlsx --ops ops.json --backend openpyxl
exstruct ops list
exstruct ops describe create_chart --pretty
exstruct validate --input book.xlsx --pretty
  • patch and make print JSON PatchResult to stdout.
  • This is the canonical operational / agent interface for workbook editing.
  • ops list / ops describe expose the public patch-op schema.
  • validate reports workbook readability (is_readable, warnings, errors).
  • Phase 2 keeps the legacy extraction CLI unchanged; it does not add exstruct extract or interactive safety flags yet.

Recommended edit flow:

  1. Build patch ops.
  2. Run exstruct patch --dry-run and inspect PatchResult, warnings, and diff.
  3. Pin --backend openpyxl when you want the dry run and the real apply to use the same engine.
  4. If you keep --backend auto, inspect PatchResult.engine; on Windows/Excel hosts the real apply may switch to COM.
  5. Re-run without --dry-run only after the result is acceptable.

ExStruct CLI Skill

ExStruct also ships one repo-owned Skill for agents that should follow the editing CLI safely instead of rediscovering the workflow each time.

Canonical repo source:

  • .agents/skills/exstruct-cli/

You can install it with the following single command:

npx skills add harumiWeb/exstruct/.agents/skills --skill exstruct-cli

That command should install exstruct-cli directly from this repository's

Related Skills

View on GitHub
GitHub Stars133
CategoryDevelopment
Updated3d ago
Forks21

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings