PdfToText
Extracts text from PDF files
Install / Use
/learn @christian-vigh-phpclasses/PdfToTextREADME
INTRODUCTION
The PdfToText class has been designed to extract textual contents from a PDF file.
It's pretty simple to use :
include ( 'PdfToText.phpclass' ) ;
$pdf = new PdfToText ( 'sample.pdf' ) ;
echo $pdf -> Text ; // or you could also write : echo ( string ) $pdf ;
The same PdfToText object can be reused to process additional files :
$pdf -> Load ( 'sample2.pdf' ) ;
echo $pdf -> Text ;
Additionally, the PdfToText class provides support methods for getting the page number of any text in the underlying PDF file.
Look at the class' blog for an overview on the underlying mechanics that are involved into extracting text contents from pdf files.
Examples are also provided in the examples/ directory. Please have a look at the examples/README.md file for a brief explanation on their structure.
IMPORTANT : the PdfToText class generates UTF8-encoded text. If your default character set is not UTF-8, you may need to add the following meta tag in the <head> part of your HTML page :
<meta charset="utf-8" />
FEATURES
Text rendering in a PDF file is made using an obscure language which provides multiple ways to position the same text at the same location on a page. You could say for example :
. Goto coordinates (x,y)
. Render text ( "Mail : someone@somewhere.com" )
Or :
. Goto next line
. Goto (x1,y)
. Render text ( "Mail" )
. Goto (x2, y)
. Render text ( ":" )
. Goto (x3, y)
. Render text ( "someone@somewhere.com" )
(note that I'm using a pseudo-language here). Both pieces of code would probably display the same text at the same position, by using rather different ways.
This is why the PdfToText class tracks the following information from the drawing-instruction stream to provide more accurate text rendering (even if the output is only pure text) :
-
The currently selected font is tracked. This is important because :
- Each font in a PDF file can have its own character map. This means in this case that characters to be drawn using the Adobe language do not specify actual character codes, but an index into the font's character map.
- The current font size is memorized ; this helps to evaluate what is the current y-coordinate when relative positioning instructions are used (such as "goto next line"). Although approximative, this works in a great majority of cases
-
If multiple strings are rendered using identical y-coordinate, they will be grouped onto the same line. Note that they must appear sequentially in the instruction flow for this trick to work
-
Sub/super-scripted text is usually written at a slightly different y-coordinate than the line it appears in. Such a situation is detected, and the sub/super-scripted text will correctly appear onto the same line
These symptoms will not appear if the PDFOPT_BASIC_LAYOUT option is specified.
ADVANCED FEATURES
The class is able to :
- Render basic page layout (ie, the text is drawn in the same order that Acrobat Reader renders it) using the PDFOPT_BASIC_LAYOUT option.
- Retrieve form data as a standalone object, using the GetFormData() method.
- Capture areas of text within a page
KNOWN ISSUES
Here is a list about known issues for the PdfToText class ; I'm working on solving them, so I hope this whole paragraph will soon completely disappear !
- Unwanted line breaks may occur within text lines. This is due to the fact that the pdf file contains drawing instructions that use relative positioning. This is especially true for file created with generators such as PdfCreator. However, some provisions have been made to try to track put text with roughly the same y-coordinates onto the same line. This limitation does not apply if the PDFOPT_BASIC_LAYOUT option is specified.
- Encrypted PDF files are not supported
A NOTE FOR WINDOWS USERS
An Apache server on Linux platforms allocates a default stack size of 8Mb for its threads. This value is set to 1Mb on Windows platforms.
However, some regular expressions used by the PdfToText class may cause the PHP PCRE extension to require a little bit more than 1Mb of stack space when processing certain PDF files.
Such a situation will cause your Windows Apache server to crash and your browser to display a message such as : Connection reset. This behavior affect several products such as EasyPHP, XAMPP or Wamp.
To solve this issue, you will have to enable the mpm module in your httpd.conf file and define a new stack size, as in the following example, given for a Wamp server :
Include conf/extra/httpd-mpm.conf
ThreadStackSize 8388608
TESTING
I have tested this class against dozens of documents from various origins, and tested the output generated from each sample document by the PdfCreator, PrimoPdf and PDF Pro tools.
I also compared the output of the PdfToText class with that of Acrobat Reader, when you choose the Save as...Text option. In many situations, the class performs better in positioning the final text than Acrobat Reader does.
However, all of that will not guarantee that it will work in every situation ; so, if you find something weird or not functioning properly using the PdfToText class, feel free to contact me on this class' blog, and/or send me a sample PDF file at the following email address :
christian.vigh@wuthering-bytes.com
OTHER LINKS
This class can also be found here :
http://www.phpclasses.org/package/9732-PHP-Extract-text-contents-from-PDF-files.html
and here, where you will also find a FAQ section and be able to upload your PDF file samples for live testing :
and also here :
https://github.com/christian-vigh-phpclasses/PdfToText
REFERENCE
METHODS
Constructor
$pdf = new PdfToText ( $filename = null, $options = self::PDFOPT_NONE, $user\_password = false, $owner\_password = false ) ;
Instantiates a PdfToText object. If a filename has been specified, its text contents will be loaded and made available in the Text property (otherwise, you will have to call the Load() method for that).
See the Options property for a description of the $options parameter.
The $user_password and $owner_password parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).
In the current version, decryption of password-protected files is not yet supported.
Load ( $filename, $user_password = false, $owner_password = false )
Loads the text contents of the specified filename.
The $user_password and $owner_password parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).
In the current version, decryption of password-protected files is not yet supported.
The method returns the decoded text contents, which are also available through the Text property.
LoadFromString ( $contents, $user_password = false, $owner_password = false )
Loads the text contents of the specified PDF contents.
The $user_password and $owner_password parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).
In the current version, decryption of password-protected files is not yet supported.
The method returns the decoded text contents, which are also available through the Text property.
AddAdobeExtraMappings ( $mappings )
Adobe supports 4 predefined fonts : standard, Mac, WinAnsi and PDF). All the characters in these fonts are identified by a character time, a little bit like HTML entities ; for example, 'one' will be the character '1', 'acircumflex' will be 'â', etc.
There are thousands of character names defined by Adobe (see https://mupdf.com/docs/browse/source/pdf/pdf-glyphlist.h.html).
Some of them are not in this list ; this is the case for example of the 'ax' character names, where 'x' is a decimal number. When such a character is specified in a /Differences array, then there is somewhere a CharProc[] array giving an object id for each of those characters.
The referenced object(s) in turn contain drawing instructions to draw the glyph. At no point you could guess what is the corresponding Unicode character for this glyph, since the information is not contained in the PDF file.
The AddAdobeExtraMappings() method allows you to specify such correspondences. Specify an array as the $mappings parameter, whose keys are the Adobe character name (for example, "a127") and values the corresponding Unicode values.
The $mappings parameter is an associative array whose keys are Adobe character names. The array values can take several forms :
- A character
- An integer value
- An array of up to four character or integer values. Internally, every specified value is converted to an array of four integer values, one for each of the standard Adobe character sets (Standard, Mac, WinAnsi and PDF). The following rules apply :
- If the input value is a single character, the output array corrsponding the Adobe character name will be a set of 4 elements corresponding to the ordinal value of the supplied character.
- If the input value is an integer, the output array will be a set of 4 identic
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
