Pdflexer
.net pdf parsing library
Install / Use
/learn @pdflexer/PdflexerREADME
pdflexer
pdflexer is a PDF parsing library. It is focused on efficient parsing and modification of PDF files and is mainly targeted for users familiar with the pdf spec. It is generally very fast at what it does (eg. splitting / merging / text extract shows multiple times better performance than alternatives). The parsing logic was implemented from scratch but some higher level functionality (eg. filters) were ported from the pdf.js project.
pdflexer differs from existing .net libraries in that it:
- Is primarly designed for PDF modification (not just reading). Any object / page read from a PDF can be modified and written to others PDFs.
- Mutable model for page contents. Move, delete, modify existing text and graphics on page (note: in active development)
- Has lazy parsing features which allow objects to be parsed on demand increasing performance in many cases.
- Modern .net features (nullable enabled, Span, ArrayPool, Generic math)
- Designed for direct access to the native PDF objects types. Any higher level objects are simple wrappers areound the native pdf object types (eg
PdfPageis a wrapper around aPdfDictionary. ThePdfDictionarycan be directly modified for features not implemented onPdfPage) - Attempts to be performant / efficient. Not a ton of effort has been put in here but it is a goal to keep this in mind.
State of library
| Feature | WIP | Alpha | Beta | Release | | ---------------------------------------------------------------------------------- | ------------------ | ------------------ | ------------------ | ------------------ | | Document access | | | | :heavy_check_mark: | | General modification <br> (non page content) | | | | :heavy_check_mark: | | Merging / splitting | | | | :heavy_check_mark: | | Streaming writer | | | | :heavy_check_mark: | | Page content access | | | :heavy_check_mark: | | | Text extraction | | | :heavy_check_mark: | | | Image extraction | | | :heavy_check_mark: | | | Resource dedup | | :heavy_check_mark: | | | | Content creation | | :heavy_check_mark: | | | | Content redaction | | :heavy_check_mark: | | | | Mutable Content | :heavy_check_mark: | | | |
- Release - API stable and few breaking changes are expected. Feature has significant test coverage and has been used in real use cases on a wide variety of pdfs
- Beta - API stable but some breaking changes are expected. Feature has some test coverage and has been used in some real use cased.
- Alpha - API unstable and breaking changes are expected. Feature generally functional but may lack test coverage and may not have any real use.
- WIP - API unstable and many breaking changes are expected. Feature may have significant bugs, may lack test coverage and may not have any real use.
Major Gaps
- [ ] Filter support (ascii85, asciihex, ccitt, deflate, lzw, and run length completed)
- [ ] Public API cleanup / documentation. Lots of classes / properties exposed that will likely be internalized.
- [ ] Documentation / examples
Current Save Behavior
PdfDocument.SaveTo() currently rewrites the document with a page-centric focus to make page copying, page re-ordering, and related production workflows simpler.
Important implications of the current save path:
- Existing catalog
/Namescontent is not preserved on save - Existing
/StructTreeRootcontent is not preserved unless rebuilt through the current structural tree support - Encrypted PDFs are rewritten without preserving original encryption settings
This means an open/save cycle may remove or alter features such as:
- named destinations
- embedded files / attachments referenced through name trees
- JavaScript and other name-tree-backed catalog features
- existing tagged PDF structure that is not reconstructed in memory
If these features matter for your workflow, validate the saved output carefully before using pdflexer as a general round-trip rewrite tool.
Examples
Some examples are available as polyglot notebooks in the /examples/ folder.
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
summarize
346.8kSummarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).
feishu-doc
346.8k|
