SkillAgentSearch skills...

Pydocsplit

Python "port" of DocumentCloud's great Docsplit utility for splitting PDFs into text and images

Install / Use

/learn @anderser/Pydocsplit
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

#pyDocsplit

A simple Python wrapper of the great Docsplit utility from DocumentCloud http://github.com/documentcloud/docsplit

Please feel free to file issues, fork and extend!

##Requirements:

##Installation:

Follow the instructions to install the original Docsplit here: http://documentcloud.github.com/docsplit/

Put the pydocsplit folder on your python path and change the DOCSPLIT_JAVA_ROOT setting in docsplit.py to point to your installation of the Ruby gem

Remember to run OpenOffice in headless mode if you want to convert documents to pdf. See the Docsplit docs for howto: http://documentcloud.github.com/docsplit/

##Usage:

from pydocsplit import Docsplit

d = Docsplit()
d.extract_pdf('/path/to/my/document.doc', output='/path/to/outputdir/')
d.extract_pages('/path/to/my/pdffile.pdf', output='/path/to/outputdir/', pages='1-2')
d.extract_text('/path/to/my/pdffile.pdf', output='/path/to/outputdir/', returntext=True)
d.extract_images('/path/to/my/pdffile.pdf', output='/path/to/outputdir/', sizes=['500x', '250x'], formats=['png', 'jpg'], pages=[1,2,5,7])
documenttitle = d.extract_meta('/path/to/my/pdffile.pdf', 'title')

##TODO:

  • Support multiple pdfs as input
  • Enhance parsing of pages options/ranges
  • Fix page numbers on generated images of PDF pages
View on GitHub
GitHub Stars29
CategoryDevelopment
Updated1y ago
Forks2

Languages

Python

Security Score

60/100

Audited on Jun 14, 2024

No findings