SkillAgentSearch skills...

TextAtAnyCost

PHP moduels to read text content from different binary formats: PDF, DOC, PPT, RTF (not binary) etc.

Install / Use

/learn @rembish/TextAtAnyCost
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

TextAtAnyCost

Extract plain text from common document formats — no external programs or PECL extensions required.

Supported formats

| Format | Extension | Notes | |--------|-----------|-------| | Microsoft Word 97–2003 | .doc | CFB/WCBFF, ANSI and Unicode | | Microsoft PowerPoint 97–2003 | .ppt | CFB/WCBFF | | Adobe PDF | .pdf | FlateDecode, ASCII-85, ASCII-Hex, ToUnicode CMaps | | Rich Text Format | .rtf | Stack-based parser, Mac Roman + Windows-1251 | | Word 2007+ (Open XML) | .docx | ZIP + XML | | OpenDocument Text | .odt | ZIP + XML | | RAR archives (read list) | .rar | RAR 4.x, no PECL required | | RAR archives (write/store) | .rar | Store method only |

Requirements

  • PHP 8.3 or later
  • Extensions: mbstring, zlib, dom, zip (all standard in PHP 8)

Installation

Via Composer (recommended)

composer require rembish/text-at-any-cost

Directly from GitHub

composer require rembish/text-at-any-cost:dev-master

Packagist: submit your GitHub URL at packagist.org once to enable tagged releases (composer require rembish/text-at-any-cost:^1.0).

Usage

Unified facade (auto-detects by extension)

use TextAtAnyCost\TextExtractor;

$text = TextExtractor::fromFile('/path/to/document.docx');

Individual parsers

use TextAtAnyCost\Parser\DocParser;
use TextAtAnyCost\Parser\PdfParser;
use TextAtAnyCost\Parser\PptParser;
use TextAtAnyCost\Parser\RtfParser;
use TextAtAnyCost\Parser\ZippedXmlParser;

$text = (new DocParser())->extractText('report.doc');
$text = (new PdfParser())->extractText('report.pdf');
$text = (new PptParser())->extractText('slides.ppt');
$text = (new RtfParser())->extractText('memo.rtf');
$text = (new ZippedXmlParser())->extractDocx('report.docx');
$text = (new ZippedXmlParser())->extractOdt('report.odt');

RTF from a string

use TextAtAnyCost\Parser\RtfParser;

$text = (new RtfParser())->parseString($rtfString);

RAR archives

use TextAtAnyCost\Archive\RarReader;
use TextAtAnyCost\Archive\RarWriter;

// List files
$reader = new RarReader();
$files  = $reader->getFileList('archive.rar');
$tree   = $reader->getFileTree('archive.rar');

// Create a stored (no-compression) archive
$writer = new RarWriter();
$writer->create('output.rar');
$writer->addDirectory('docs/reports');
$writer->addFile('/var/www/report.pdf', 'docs/reports');
$writer->close();

Procedural wrappers (backward-compatible)

Each parser file still exports a procedural function for drop-in compatibility:

require 'vendor/autoload.php';

$text = doc2text('report.doc');
$text = pdf2text('report.pdf');
$text = ppt2text('slides.ppt');
$text = rtf2text('memo.rtf');
$text = docx2text('report.docx');
$text = odt2text('report.odt');

Error handling

All parsers throw TextAtAnyCost\Exception\ParseException (extends RuntimeException) on structural or I/O errors. TextExtractor::fromFile() additionally throws \InvalidArgumentException for unsupported extensions.

use TextAtAnyCost\Exception\ParseException;
use TextAtAnyCost\TextExtractor;

try {
    $text = TextExtractor::fromFile($path);
} catch (ParseException $e) {
    // file unreadable or format invalid
} catch (\InvalidArgumentException $e) {
    // extension not supported
}

Development

All development tasks run inside Docker — no local PHP installation required.

make install       # install Composer dependencies
make test          # run PHPUnit test suite
make stan          # PHPStan static analysis (level 8)
make cs            # check code style (PHP-CS-Fixer, dry-run)
make cs-fix        # apply code-style fixes
make lint          # PHP syntax check on all files
make test-coverage # HTML coverage report in coverage/
make shell         # interactive shell in the container

Architecture

src/
├── Exception/
│   └── ParseException.php
├── Parser/
│   ├── CfbParser.php          # Abstract base: Windows Compound Binary File
│   ├── DocParser.php          # .doc  (extends CfbParser)
│   ├── PptParser.php          # .ppt  (extends CfbParser)
│   ├── PdfParser.php          # .pdf
│   ├── RtfParser.php          # .rtf
│   └── ZippedXmlParser.php    # .docx / .odt
├── Archive/
│   ├── RarReader.php          # RAR 4.x file listing
│   └── RarWriter.php          # RAR store-mode archive creation
└── TextExtractor.php          # Unified facade

Changelog / Bug fixes

The following bugs from the original 2009 codebase were fixed during modernisation:

| File | Bug | |------|-----| | stored-rar.php | getDateTime(): inverted null-check always returned the current time, ignoring the provided timestamp | | stored-rar.php | getBytes(): strlen(0) returns 1, not 0 — header size was off by one for zero-length fields | | pdf.php | Single-quoted '\n', '\r' etc. are literal two-character strings in PHP — text output contained backslash-n instead of actual newlines | | pdf.php | FILE_BINARY constant does not exist in PHP; removed (the flag was silently ignored) | | cfb.php | Dead code after continue including a debug echo "@" statement that would corrupt output | | cfb.php | while(...["type"] == 0) array_pop() could loop forever on an empty array (PR #7) | | doc.php | html_entity_decode("&#x...;") replaced with mb_chr() for correct multi-byte output (PR #9) | | zipped-xml.php | LIBXML_XINCLUDE removed — it allowed XML <xi:include> to read arbitrary local files (XXE) | | zipped-xml.php | Lossy iconv("utf-8", "windows-1250") conversion removed; output is now UTF-8 throughout | | rtf.php | Stack underflow when j < 0 or stack entry missing (PR #4) |

License

BSD 3-Clause — see LICENSE.

Related Skills

View on GitHub
GitHub Stars70
CategoryContent
Updated28d ago
Forks39

Languages

PHP

Security Score

95/100

Audited on Mar 1, 2026

No findings