Oxmltotext
A lightweight and efficient text content extractor mainly for OOXML files (typically referring to docx/xlsx/pptx).
Install / Use
/learn @young2j/OxmltotextREADME
😡😡😡Dumping nuclear wastewater into the ocean, damn it! 💣🗾💥😤😤😤
🎯 About
Oxmltotext is a lightweight and efficient text content extractor mainly for OOXML files (typically referring to DOCX/XLSX/PPTX). Solutions are also available for PDF as well as DOC/XLS/PPT formats.
✨ Features
This repo provides the following functionalities:
- Extracting text content from DOCX/XLSX/PPTX format(files,readers or URL) , with the option to extract text from charts/diagrams by configuring settings. It can also extract text from images within the files using default tesseract or custom OCR interfaces.
- Extracting text content from PDF format(files,readers or URL) using
go-fitz. - Extracting text content from DOC format(files,readers or URL) using the
antiwordcommand-line tool. - Extracting text content from XLS format(files,readers or URL) using the
xlstotextprogram(compiled using rust). - Extracting text content from PPT format(files,readers or URL) using the
tika server(about tika, seehttps://tika.apache.org/).
⚠️ Please note that this repo does not validate the validity of each file format.
✅ Requirements
golang >=1.21.0
🛠 Installation
go get -u github.com/young2j/oxmltotext@latest
:rocket: Quick Start
1. Extract text from docx/xlsx/pptx format
For these formats, the interfaces are consistent. Taking docx as an example:
plain text
import (
"fmt"
"github.com/young2j/oxmltotext/docxtotext"
)
func main() {
dp, err := docxtotext.Open("../filesamples/file-sample_100kb.docx")
if err != nil {
panic(err)
}
defer dp.Close() // Please remember to call the `Close` method to avoid memory leaks.
texts, err := dp.ExtractTexts()
if err != nil {
panic(err)
}
fmt.Println(texts)
}
Output looks like this:
...
-------------------------------------------------------------------------------------
Comment for demo.
-------------------------------------------------------------------------------------
Page Header ForDemo
-------------------------------------------------------------------------------------
Page Foot ForDemo
-------------------------------------------------------------------------------------
Footnote for demo.
-------------------------------------------------------------------------------------
Endnote for demo.
charts and diagrams
Extract text of charts and diagrams:
func main() {
dp, err := docxtotext.Open("../filesamples/file-sample_100kb.docx")
if err != nil {
panic(err)
}
defer dp.Close() // Please remember to call the `Close` method to avoid memory leaks.
dp.SetParseCharts(true) // set true if you want to parse charts text
dp.SetParseDiagrams(true) // set true if you want to parse diagrams text
texts, err := dp.ExtractTexts()
if err != nil {
panic(err)
}
fmt.Println(texts)
}
Output looks like this:
...(other texts)
┌───────────────chart───────────────┐
[系列 1]
类别 1 类别 2 类别 3 类别 4
4.3 2.5 3.5 4.5
[系列 2]
类别 1 类别 2 类别 3 类别 4
2.4 4.4000000000000004 1.8 2.8
[系列 3]
类别 1 类别 2 类别 3 类别 4
2 2 3 5
└───────────────────────────────────┘
┌──diagram──┐
smartart 1
smartart 2
smartart3
└───────────┘
...(other texts)
Of course, you can also remove the formatting borders through API settings.
OCR
Extract text of images(OCR):
If OCR interface is not set, default tesseract-ocr will be used. So you should install tesseract-ocr first for different operation system.
If you use apt as package manager, you can run:
apt install -y --no-install-recommends libtesseract-dev # libs apt install -y --no-install-recommends tesseract-ocr-eng tesseract-ocr-chi-sim tesseract-ocr-script-hans # language packagesIf you use homebrew on MacOS, you can run:
brew install tesseract brew install tesseract-lang # language packagesFor more details, see tesseract
func main() {
dp, err := docxtotext.Open("../filesamples/file-sample_100kb.docx")
if err != nil {
panic(err)
}
defer dp.Close() // Please remember to call the `Close` method to avoid memory leaks.
dp.SetParseImages(true) // set true if you want to parse images text
texts, err := dp.ExtractTexts()
if err != nil {
panic(err)
}
fmt.Println(texts)
}
Output looks like this:
...(other texts)
┌──────────────────────image──────────────────────┐
姓名 韦小宝
性 别 男 民族 汉
出 生 1654 £12 2208
ff 址 北京 市 东城 区 景山 前 街 4 号
紫禁城 敬 事 房
公民 身份 证 号 码 11204416541220243x
└─────────────────────────────────────────────────┘
...(other texts)
2. Extract text from pdf format
import (
"fmt"
"github.com/young2j/oxmltotext/pdftotext"
)
func main() {
pp, err := pdftotext.Open("../filesamples/file-sample_500kb.pdf")
if err != nil {
panic(err)
}
defer pp.Close() // Please remember to call the `Close` method to avoid memory leaks.
// Extract the text of page 1,2
// texts, err := pp.ExtractPageTexts(1,2)
texts, err := pp.ExtractTexts()
if err != nil {
panic(err)
}
fmt.Println(texts)
}
3. Extract text from doc format
To work for a doc file, you need to install Antiword.
apt install -y --no-install-recommends antiword # or on MacOS brew install antiword
import (
"fmt"
"github.com/young2j/oxmltotext/doctotext"
)
func main() {
texts, err := doctotext.ExtractFromPath("../filesamples/file-sample_100kb.doc")
if err != nil {
panic(err)
}
fmt.Println(texts)
}
4. Extract text from xls format
To work for a xls file, you should first compile the
xlstotextexecutable program using Cargo, and then add it to your environment variables.cd xlstotext/rs cargo build --relese # executable program: xlstotext/rs/target/release/xlstotext
import (
"fmt"
"github.com/young2j/oxmltotext/xlstotext"
)
func main() {
texts, err := xlstotext.ExtractFromPath("../filesamples/file-sample_100kb.xls")
if err != nil {
panic(err)
}
fmt.Println(texts)
}
5. Extract text from ppt format
If you need to extract text from ppt files and the only solution you have is Apache Tika, then indeed, you would need to run a Tika server. For testing, you can run the follow command to start the server on your machine.
# see tikaserver/local.sh wget --no-check-certificate https://dlcdn.apache.org/tika/2.9.1/tika-server-standard-2.9.1.jar java -jar tika-server-standard-2.9.1.jarTika server runs on the default port 9998.
import (
"fmt"
"github.com/young2j/oxmltotext/ppttotext"
)
func main() {
texts, statusCode, err := ppttotext.ExtractFromPathByTika("../filesamples/file-sample_500kb.ppt", "http://localhost:9998/tika")
if err != nil {
panic(err)
}
fmt.Printf("tika server respose status code:%d\n", statusCode)
fmt.Println(texts)
}
:hammer: Build Tags
Due to the need to install additional dependencies and since it's not a frequent requirement, as well as the potential impact on performance, OCR (Optical Character Recognition) for image text is not enabled by default. This repo utilizes the Go build tag "ocr" for conditional compilation. If you want to enable the default OCR interface (unless you provide a custom OCR implementation), you need to add the "ocr" tag during program compilation.
Your build command should contain the tag named "ocr":
go build -tags ocr .
🔥 Benchmark
cd tikaserver
java -jar tika-server-standard-2.9.1.jar
cd ../benchmark
make bench_all
bench_docx
goos: darwin
goarch: amd64
pkg: gobench
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Benchmark_ParseDocxByGooxml-8 31 36919814 ns/op 6993180 B/op 125551 allocs/op
Benchmark_ParseDocxByGodocx-8 87 15920631 ns/op 3748453 B/op 83037 allocs/op
Benchmark_ParseDocxByDocconv-8 100 15065089 ns/op 6537887 B/op 76339 allocs/op
Benchmark_ParseDocxByOxmlToText-8 322 3338957 ns/op 830322 B/op 18999 allocs/op
Benchmark_ParseDocxByTika-8 1 1407496211 ns/op 153448 B/op 270 allocs/op
PASS
ok gobench 8.718s
bench_xlsx
go test -benchmem -bench ^Benchmark_ParseXlsx -benchtime=1s
goos: darwin
goarch: amd64
pkg: gobench
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Benchmark_ParseXlsxByGooxml-8 102 11412378 ns/op 1875498 B/op 31877 allocs/op
Benchmark_ParseXlsxByTealeg-8 199 5918170 ns/op 1868823 B/op 34834 allocs/op
Benchmark_ParseXlsxByExcelize-8 151 7667531 ns/op 3001478 B/op 33366 allocs/op
Benchmark_ParseXl
Related Skills
xurl
338.7kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
docs-writer
99.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
338.7kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
kubeshark
11.8kCluster-wide network observability for Kubernetes. Captures L4 packets, L7 API calls, and decrypted TLS traffic using eBPF, with full Kubernetes context. Available to AI agents via MCP and human operators via dashboard.
