Recommend this page to a friend! |
Download |
Info | Example | Files | Install with Composer | Download | Reputation | Support forum | Blog | Links |
Ratings | Unique User Downloads | Download Rankings | ||||
73% | Total: 9,458 This week: 4 | All time: 157 This week: 18 |
Version | License | PHP version | Categories | |||
pdf-to-text 1.6.7 | GNU General Publi... | 5.1.0 | PHP 5, Files and Folders, Text proces..., P... |
Description | Author | |||
This package can extract the text contents from a PDF file using pure PHP code (no external tools are needed). Innovation Award
|
Recommendation for a PHP class to extract table from pdf
I want to extract table from pdf to insert to database website
Extracting PDF to Text
I need to parse PDF file text
What is the best PHP pdf to text converter class?
Convert a PDF document into text
What is the best PHP pdf parser class?
Read PDF documents
convert pdf to text
convert pdf to text that i will read from the email
What is the best PHP search string in pdf class?
Search string in PDF and return page number
please provide pdf2txt
please provide pdf2txt please provide pdf2txtplease provide pdf2
extract data from pdf file which consists of more tables
I want to extract data from pdf file and store it into mysql
What is the best PHP pdf to text class?
pdf to text format in php
What is the best PHP pdf reader class?
I need a pdf reader class
What is the best PHP pdf to html class?
Convert PDF to HTML
What is the best PHP pdf metadata class?
PDF metadata
What is the best PHP pdf to excel class?
PDF to Excel library converter
<?php
|
The PdfToText class has been designed to extract textual contents from a PDF file.
It's pretty simple to use :
include ( 'PdfToText.phpclass' ) ;
$pdf = new PdfToText ( 'sample.pdf' ) ;
echo $pdf -> Text ; // or you could also write : echo ( string ) $pdf ;
The same PdfToText object can be reused to process additional files :
$pdf -> Load ( 'sample2.pdf' ) ;
echo $pdf -> Text ;
Additionally, the PdfToText class provides support methods for getting the page number of any text in the underlying PDF file.
Look at the class' blog for an overview on the underlying mechanics that are involved into extracting text contents from pdf files.
Examples are also provided in the examples/ directory. Please have a look at the examples/README.md file for a brief explanation on their structure.
IMPORTANT : the PdfToText class generates UTF8-encoded text. If your default character set is not UTF-8, you may need to add the following meta tag in the <head> part of your HTML page :
<meta charset="utf-8" />
Text rendering in a PDF file is made using an obscure language which provides multiple ways to position the same text at the same location on a page. You could say for example :
. Goto coordinates (x,y)
. Render text ( "Mail : someone@somewhere.com" )
Or :
. Goto next line
. Goto (x1,y)
. Render text ( "Mail" )
. Goto (x2, y)
. Render text ( ":" )
. Goto (x3, y)
. Render text ( "someone@somewhere.com" )
(note that I'm using a pseudo-language here). Both pieces of code would probably display the same text at the same position, by using rather different ways.
This is why the PdfToText class tracks the following information from the drawing-instruction stream to provide more accurate text rendering (even if the output is only pure text) :
The currently selected font is tracked. This is important because :
- Each font in a PDF file can have its own character map. This means in this case that characters to be drawn using the Adobe language do not specify actual character codes, but an index into the font's character map. - The current font size is memorized ; this helps to evaluate what is the current y-coordinate when relative positioning instructions are used (such as "goto next line"). Although approximative, this works in a great majority of cases
These symptoms will not appear if the PDFOPT\_BASIC\_LAYOUT option is specified.
The class is able to :
Here is a list about known issues for the PdfToText class ; I'm working on solving them, so I hope this whole paragraph will soon completely disappear !
An Apache server on Linux platforms allocates a default stack size of 8Mb for its threads. This value is set to 1Mb on Windows platforms.
However, some regular expressions used by the PdfToText class may cause the PHP PCRE extension to require a little bit more than 1Mb of stack space when processing certain PDF files.
Such a situation will cause your Windows Apache server to crash and your browser to display a message such as : Connection reset. This behavior affect several products such as EasyPHP, XAMPP or Wamp.
To solve this issue, you will have to enable the mpm module in your httpd.conf file and define a new stack size, as in the following example, given for a Wamp server :
Include conf/extra/httpd-mpm.conf
ThreadStackSize 8388608
I have tested this class against dozens of documents from various origins, and tested the output generated from each sample document by the PdfCreator, PrimoPdf and PDF Pro tools.
I also compared the output of the PdfToText class with that of Acrobat Reader, when you choose the Save as...Text option. In many situations, the class performs better in positioning the final text than Acrobat Reader does.
However, all of that will not guarantee that it will work in every situation ; so, if you find something weird or not functioning properly using the PdfToText class, feel free to contact me on this class' blog, and/or send me a sample PDF file at the following email address :
christian.vigh@wuthering-bytes.com
This class can also be found here :
http://www.phpclasses.org/package/9732-PHP-Extract-text-contents-from-PDF-files.html
and here, where you will also find a FAQ section and be able to upload your PDF file samples for live testing :
and also here :
https://github.com/christian-vigh-phpclasses/PdfToText
$pdf = new PdfToText ( $filename = null, $options = self::PDFOPT_NONE, $user\_password = false, $owner\_password = false ) ;
Instantiates a PdfToText object. If a filename has been specified, its text contents will be loaded and made available in the Text property (otherwise, you will have to call the Load() method for that).
See the Options property for a description of the $options parameter.
The $user\_password and $owner\_password parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).
In the current version, decryption of password-protected files is not yet supported.
Loads the text contents of the specified filename.
The $user\_password and $owner\_password parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).
In the current version, decryption of password-protected files is not yet supported.
The method returns the decoded text contents, which are also available through the Text property.
Loads the text contents of the specified PDF contents.
The $user\_password and $owner\_password parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).
In the current version, decryption of password-protected files is not yet supported.
The method returns the decoded text contents, which are also available through the Text property.
Adobe supports 4 predefined fonts : standard, Mac, WinAnsi and PDF). All the characters in these fonts are identified by a character time, a little bit like HTML entities ; for example, 'one' will be the character '1', 'acircumflex' will be 'â', etc.
There are thousands of character names defined by Adobe (see https://mupdf.com/docs/browse/source/pdf/pdf-glyphlist.h.html).
Some of them are not in this list ; this is the case for example of the 'ax' character names, where 'x' is a decimal number. When such a character is specified in a /Differences array, then there is somewhere a CharProc[] array giving an object id for each of those characters.
The referenced object(s) in turn contain drawing instructions to draw the glyph. At no point you could guess what is the corresponding Unicode character for this glyph, since the information is not contained in the PDF file.
The AddAdobeExtraMappings() method allows you to specify such correspondences. Specify an array as the $mappings parameter, whose keys are the Adobe character name (for example, "a127") and values the corresponding Unicode values.
The $mappings parameter is an associative array whose keys are Adobe character names. The array values can take several forms :
Note
In this current implementation, the method applies the mappings to ALL Adobe default fonts. That is, you cannot have one mapping for one Adobe font referenced in the PDF file, then a second mapping for a second Adobe font, etc.
$captures = $pdf -> GetCaptures ( $full = false ) ;
Retrieves the areas of text captured by the PdfToText class. This assumes that you specified first the PDFOPT\_CAPTURE flag to the class constructor, then called either the SetCaptures() or SetCapturesFromString() method.
When the $full parameter is set to false (the default), the returned object is a hierarchy of stdClass objects which maps capture names to their values.
When set to true, the returned object is of type PdfToTextCaptures, which holds much more information. It should be useful only when doing internal debugging of the PdfToText class.
Note :
Accessing a property value when the $full parameter is false can be performed like this (here, we are accessing the value of the Title capture in the first page) :
$captures -> Title [1]
When this parameter is true, you have to specify the Text property to retrieve its contents :
$captures -> Title [1] -> Text
If you plan to switch between both return types during your development phase, you can use a unified approach that works in both cases :
( string ) $captures -> Title [1]
See the Capturing Text* section for more information on capturing text from PDF files.
$count = $pdf -> GetFormCount ( ) ;
Returns the actual number of top-level forms defined in the PDF file.
$object = $pdf -> GetFormData ( $template_xml, $index = 0 ) ;
Retrieves the form data for the specified top-level form index. Data is returned as an object inheriting from class PdfToTextFormData, which provides ony helper functions to its derived classes.
Data retrieval can be based on a template XML file, or, when the $template\_xml parameter is null, a default template will be created using the field names defined in the PDF file.
See the Form templates later in this file to get more information on how templates are used and how form data objects are built.
Given a byte offset in the Text property, returns its page number in the pdf document.
Page numbers start from 1.
$status = $pdf -> HasFormData ( ) ;
Returns true if the PDF file contains form data.
Sometimes it may be convenient, when you want to extract only a portion of text, to say : "I want to extract text between this title and this title". The MarkTextLike() method provides some support for such a task. Imagine you have documents that have the same structure, all starting with an "Introduction" title :
Introduction
...
some text
...
Some other title
...
By calling the MarkTextLike() method such as in the example below :
$pdf -> MarkTextLike ( '/\bIntroduction\b/', '<M>', '</M' ) ;
then you will get as output :
<M>Introduction</M>
...
some text
...
<M>Some other title</M>
Adding such markers in the output will allow you to easily extract the text between the chapters "Introduction" and "Some other title", using a regular expression.
The font name used for the first string matched by the specified regular expression will be searched later to add markers around all the text portions using this font.
The parameters are the following :
$pdf -> SetCaptures ( $xml_file ) ;
Loads the capture definitions from the specified XML file.
Note that you must instantiate the PdfToText class with the PDFOPT\_CAPTURE flag for this function to work.
See the Capturing Text* section for more information on capturing text from PDF files.
$pdf -> SetCaptures ( $xml_data ) ;
Same as the SetCaptures() method, but loads the capture definitions from a string instead of a file.
Note that you must instantiate the PdfToText class with the PDFOPT\_CAPTURE flag for this function to work.
See the Capturing Text* section for more information on capturing text from PDF files.
$result = $pdf -> text_strpos ( $search, $start = 0 ) ;
$result = $pdf -> text_stripos ( $search, $start = 0 ) ;
These methods behave as the strpos/stripos PHP functions, except that :
Parameters are the following :
The method returns an array of two values containing the page number and text offset if the searched string has been found, or false otherwise.
$result = $pdf -> document_strpos ( $search, $group_by_page = false ) ;
$result = $pdf -> document_stripos ( $search, $group_by_page = false ) ;
Searches for ALL occurrences of a given string in the pdf document. The value of the $group_by_page parameter determines how the results are returned :
For example, if a pdf document contains the string "here" at character offset 100 and 200 in page 1, and position 157 in page 3, the returned value will be :
When $group\_by\_page is false :
[ [ 1, 100 ], [ 1, 200 ], [ 3, 157 ] ]
When $group\_by\_page is true :
[ 1 => [ 100, 200 ], 3 => [ 157 ]
The parameters are the following :
The method returns an array of page numbers/character offsets or false if the specified string does not appear in the document.
$status = $pdf -> text_match ( $pattern, &$match = null, $flags = 0, $offset = 0 ) ;
$status = $pdf -> document_match ( $pattern, &$match = null, $flags = 0, $offset = 0 ) ;
text\_match() calls the preg\_match() PHP function on the pdf text contents, to locate the first occurrence of text that matches the specified regular expression.
document\_match() calls the preg\_match\_all() function to locate all occurrences that match the specified regular expression.
Note that both methods add the PREG\_OFFSET\_CAPTURE flag when calling preg\_match/preg\_match\_all so you should be aware that all captured results are an array containing the following entries :
Parameters are the following :
As for their PHP counterparts, these methods return the number of matched occurrences, or false if the specified regular expression is invalid.
This section describes the properties that are available in a PdfTText object. Note that they should be considered as read-only.
Author name, as inscribed in the PDF file.
When the PDFOPT\_AUTOSAVE\_IMAGES flag has been specified, this array of strings will contain the filenames where the auto-saved images have been put.
A string to be used for separating chunks of text. The main goal is for processing data displayed in tabular form, to ensure that column contents will not be catenated. However, this does not work in all cases.
The default value is the empty string.
When the PDFOPT\_BASIC\_LAYOUT option is specified, this property is used to separate parts of text that are visually on a different x-axis on the same line. In this case, the default separator will be a white space.
Path to the directory containing CID mapping tables.
A string containing the document creation date, in UTC format. The value can be used as a parameter to the strtotime() PHP function.
Application used to create the original document.
A PDF document normally starts with a string of the form :
%PDF-a.b[.c.d...]
where "a" and "b" (and the "c", "d", ... following) represent the version number of the PDF file.
Some PDF documents may come with garbage at the beginning ; this is "illegal" of course, but Acrobat Reader is able to cope with that. So can do the PdfToText class...
This property holds the byte offset in the input file where the starting "%PDF" string has been found.
Algorithm used for password-protected files.
The Adobe documentation states :
A code specifying the algorithm to be used in encrypting and decrypting the document :
The revision number of the Standard security handler that is required to interpret this dictionary. The revision number is :
A set of PDFPERM\_* constants describing which operations are authorized on a password-protected PDF file.
Defined only when EncryptionAlgorithm is 2 or 3. Length of key, in bits, used for encryption and decryption. The size is a multiple of 8, with a minimum value of 40 and maximum value of 128.
One of the PDFCRYPT\_\* constants.
This value is set to *PDFCRYPT\_NONE if the PDF file is not password-protected.
A flag coming from a password-protected file that says is the document metadata is also encrypted.
The string to be used for line breaks. The default is PHP\_EOL.
This property is expressed in percents ; it gives the extra percentage to add to the values computed by the PdfTexterFont::GetStringWidth() method.
This value is basically used when computing text positions and string lengths with the *PDFOPT\_BASIC\_LAYOUT option : the computed string length is shorter than its actual length (because of extra spacing determined by character kerning in the font data, among other details).
To determine whether two consecutive blocks of text on the same should be separated by a space, the class will empirically add this extra percentage to the computed string length. The default value is -5 (percent).
Name of the file whose text contents have been extracted. This value will be an empty string if the LoadFromString() method has been called instead of Load().
A pair of unique ids generated for the document. The value of ID is used for decrypting password-protected documents.
The second id is not clearly described in the Pdf specifications.
A template string to be used for generating filenames when the PDFOPT\_AUTOSAVE\_IMAGES flag has been specified.
It can contain the following print-like formats :
The default value for this property is :
$ImageAutoSaveFileTemplate = "%p/%f.%d.%s" ;
Using the template above, if your input filename is "/tmp/test.pdf" and contains 3 images, then you will get the following output images :
You can also specify a width with the "%d" format specifier ; the numbering will then be padded with leading zeroes. For example, the following template using the same example PDF file as above :
$pdf -> ImageAutoSaveFileTemplate = "%p/%f.%3d.%s" ;
will generate the following filenames :
When the PDFOPT\_AUTOSAVE\_IMAGES flag has been specified, indicates the format to be used for saving the images found in the PDF file. It can be any of the constants defined by the gd library regarding image formats :
IMG_JPEG => 'jpg',
IMG_JPG => 'jpg',
IMG_GIF => 'gif',
IMG_PNG => 'png',
IMG_WBMP => 'wbmp',
IMG_XPM => 'xpm'
Note that the association between the constant and corresponding file suffix is automatically handled.
An array of objects inheriting from the PdfImage class. Currently, only the PdfJpegIMage class is implemented.
The class currently supports the following properties :
The following methods are available :
Currently, images stored in proprietary Adobe format are not processed and will not appear in this array.
Note that images will be extracted only if the PDFOPT\_DECODE\_IMAGE\_DATA is enabled.
Number of images found in the supplied PDF file. This number will only take into account the images whose format is recognized by the PdfToText class.
An array of associative arrays that contain the following entries :
Note that image data will be extracted only if the PDFOPT\_GET\_IMAGE\_DATA is enabled.
This property is set to true if the Pdf file is encrypted through some kind of password protection scheme.
Keywords, as recorded in the author information part.
Specifies a maximum execution time in seconds for processing a single file. If this limit is reached, a PdfToTextTimeoutException exception will be thrown before PHP terminates the script. This allows the script to gracefully handle the error instead of PHP itself.
Positive values are indicated in seconds. Negative values are subtracted from the max\_execution\_time PHP setting to compute the maximum execution time allowed.
If the computed timeout value is out of range, the retained execution time will be max\_execution\_time minus one second.
The value of this property is taken into account only if the PDFOPT\_ENFORCE\_EXECUTION\_TIME option has been specified.
Maximum number of images to be extracted. The default is the value 0, meaning that all images will be selected for output if the PDFOPT\_GET\_IMAGE\_DATA or PDFOPT\_AUTOSAVE\_IMAGES option is set.
This static property is the same as MaxExecutionTime, except that it works globally. If you have to process x files, then it will ensure that the global execution time does not exceed the value of this property.
The value of this property is taken into account only if the PDFOPT\_ENFORCE\_GLOBAL\_EXECUTION\_TIME option has been specified.
Maximum number of pages to be selected. The default is the value 0, meaning that all pages will be selected for output.
A value of 1 will extract the contents of the first page only, which can be useful if your PDF file is large and you're only interested by the contents of the first page.
When this number is negative, selection starts from the end of the file : -1 means "extract the last page", -2 means "extract the last two pages", and so on.
Report the memory (peak) used by the Load() method.
Sometimes, characters (or blocks of characters) are separated by an offset which is counted in 1/1000's of text units. For certain ranges of values, when displayed on a graphical device, these consecutive characters appear to be separated by one space (or more). Of course, when generating ascii output, we would like to have some equivalent of such spacing.
This is what the MinSpaceWidth property is meant for : insert an ascii space in the generated output whenever the offset found exceeds MinSpaceWidth text units.
Note that if the PDFOPT\_REPEAT\_SEPARATOR flag is set for the Options property, the number of spaces inserted will always be based on a multiple of 1000, even if MinSpaceWidth is less than 1000. This means that if MinSpaceWidth is 200, and the Options property has the PDFOPT\_REPEAT\_SEPARATOR flag set, AND the offset between two chunks of characters is 1000 text units, only one space will be inserted, not 5 (which would be the result of 1000/200).
A string containing the last document modification date, in UTC format. The value can be used as a parameter to the strtotime() PHP function.
A combination of the following flags :
PDFOPT\_BASIC\_LAYOUT : Tries to render the text in the order you can see it with Acrobat Reader. Note that the elements will not be mapped in the output exactly as they appear with Acrobat Reader : elements physically disjoint on the x-axis will be separated by a space by default. The BlockSeparator property can be used to modify this separator. The following text for example :
Company1 Company2 address1 address2 city1 city2
will be rendered as :
Company1 Company2
address1 address2
city1 city2
or, if you set the BlockSeparator property to "#", the output will be :
Company1#Company2
address1#address2
city1#city2
PDFOPT\_NO\_HYPHENATED\_WORDS : When specified, tries to join back hyphenated words into a single word. For example, the following text :
this is a sam- ple text using hyphe- nated words that can split over seve- ral lines.
will be rendered as :
this is a sample
text using hyphenated
words that can split
over several lines.
Associative array containing individual page contents. The array key is the page number, starting from 1.
String to be used when building the Text property to separate individual pages. The default value is a newline.
Application used to generate the PDF file contents.
A string to be used for separating blocks when a negative offset less than -1000 thousands of characters is specified between two sequences of characters specified as an array notation. This trick is often used when a pdf file contains tabular data.
The default value is a space.
Subject written in the author information part.
An associative array that contains the following entries :
A string containing the whole text extracted from the underlying pdf file. Note that pages are separated with a form feed.
Document title, as specified in the author information object.
When a Unicode character cannot be correctly recognized, the Utf8Placeholder property will be used as a substitution.
The string can contain format specifiers recognized by the sprintf() function. The parameter passed to sprintf() is the Unicode codepoint that could not be recognized (an integer value).
The default value is the empty string, or the string '[Unknown character 0x%08X]' when debug mode is enabled.
Note that if you change the PdfToText::$DEBUGvariableafterthe first instantiation of the class, then you will need to manually set the value of thePdfToText::Utf8PlaceHolder* static property.
The PDFOPT\_\constants are a set of flags which can be combined when either instantiating the class or setting theOptionsproperty before calling theLoad* method. It can be a combination of any of the following flags :
PDFOPT\_BASIC\_LAYOUT : Tries to render the text in the order you can see it with Acrobat Reader. Note that the elements will not be mapped in the output exactly as they appear with Acrobat Reader : elements physically disjoint on the x-axis will be separated by a space by default. The BlockSeparator property can be used to modify this separator. The following text for example :
Company1 Company2 address1 address2 city1 city2
will be rendered as :
Company1 Company2
address1 address2
city1 city2
or, if you set the BlockSeparator property to "#", the output will be :
Company1#Company2
address1#address2
city1#city2
PDFOPT\_NO\_HYPHENATED\_WORDS : When specified, tries to join back hyphenated words into a single word. For example, the following text :
this is a sam- ple text using hyphe- nated words that can split over seve- ral lines.
will be rendered as :
this is a sample
text using hyphenated
words that can split
over several lines.
Associative array containing individual page contents. The array key is the page number, starting from 1.
String to be used when building the Text property to separate individual pages. The default value is a newline.
Current version of the PdfToText class, as a string containing a major, minor and release version numbers. For example : "1.2.19".
The PdfToText class can throw any of the following exceptions :
This is the base class for all exceptions thrown by the PdfToText class.
This exception is thrown if an error occurs when decoding a PDF object. Normally, most of these exceptions are thrown only if debug mode is activated.
Occurs when decryption failed on an encrypted file.
This exception is thrown only if one of the following conditions occur :
Thrown when an error is detected while parsing a template file for retrieving form data, or when retrieving form data.
Extracting form data is fairly simple : use the GetFormData() method and it will return you an object containing all the field values contained in your PDF file, whether they have been filled or not.
You have two ways to retrieve form data :
Both methods return a new object inheriting from the PdfToTextFormData class, which mainly contain helper functions that have no interest for the caller.
The derived class returned by the GetFormData() method has a set of properties that give you access to the form fields contents.
The examples given in the following sections are based on the file "sample.pdf", located in the examples directory "examples/formdata-extraction". It has been taken from a very common form used in the US, located here :
https://www.irs.gov/pub/irs-pdf/fw9.pdf
Gettig form data without a template is very simply ; just execute the following code :
$pdf = new PdfToText ( 'sample.pdf' ) ;
$form_data = $pdf -> GetFormData ( null ) ;
Of course, we should have checked first that form data is present in the PDF file by calling :
if ( $pdf -> HasFormData ( ) )
$form_data = $pdf -> GetFormData ( null ) ;
The object returned in $form\_data has the following definition :
$form_data = (object) PdfFormData
{
protected $f1_1 = (string[6]) "ZZNAME"
protected $f1_2 = (string[14]) "ZZBUSINESSNAME"
protected $c1_1 = (string[1]) "6"
protected $f1_3 = (string[1]) "C"
protected $c1_7 = (string[1]) "7"
protected $f1_4 = (string[7]) "ZZOTHER"
protected $f1_5 = (string[4]) "EX01"
protected $f1_6 = (string[4]) "EX02"
protected $f1_7 = (string[9]) "ZZADDRESS"
protected $f1_8 = (string[6]) "ZZCITY"
protected $f1_9 = (string[28]) "ZZREQUESTERNAME\naddress\ncity"
protected $f1_10 = (string[16]) "ZZACCOUNTNUMBERS"
protected $f1_11 = (string[3]) "123"
protected $f1_12 = (string[2]) "45"
protected $f1_13 = (string[4]) "6789"
protected $f1_14 = (string[2]) "EI"
protected $f1_15 = (string[5]) "ZZEMP"
}
You can open file sample.pdf to verify that the data listed above is conformant to the one contained in the PDF file.
However, you will see that the field names are not very explicit : $f1\_1, $f1\_2, etc. Moreover, information such as social security number is splitted in three parts : $f1\_11, $f1\_12 and $f1\_13.
This is why you may want to spend some time designing a template XML file that maps PDF field names to human-readable ones...
Using an XML template does not require many changes to your existing code ; you just need to supply the path of your XML template when calling the GetFormData() method :
$pdf = new PdfToText ( 'sample.pdf' ) ;
$form_data = $pdf -> GetFormData ( 'sample.xml' ) ;
Using the supplied example, your $form\_data object will look like this :
$form_data = (object) W9
{
const TAXCLASS_INDIVIDUAL = 1 ;
const TAXCLASS_C_CORPORATION = 2 ;
const TAXCLASS_S_CORPORATION = 3 ;
const TAXCLASS_PARTNERSHIP = 4 ;
const TAXCLASS_TRUST_ESTATE = 5 ;
const TAXCLASS_LIMITED_LIABILITY_COMPANY = 6 ;
const TAXCLASS_UNDEFINED = "" ;
const TAXCLASS_OTHER = 7 ;
protected $Name = (string[6]) "ZZNAME"
protected $BusinessName = (string[14]) "ZZBUSINESSNAME"
protected $FederalTaxClassification = (string[1]) "6"
protected $LLCClassification = (string[1]) "C"
protected $OtherFederalTaxClassification = (string[1]) "7"
protected $OtherFederalTaxInfo = (string[7]) "ZZOTHER"
protected $ExemptPayeeCode = (string[4]) "EX01"
protected $FATCAExemptionCode = (string[4]) "EX02"
protected $Address = (string[9]) "ZZADDRESS"
protected $City = (string[6]) "ZZCITY"
protected $RequesterCoordinates = (string[28]) "ZZREQUESTERNAME\naddress\ncity"
protected $AccountNumbers = (string[16]) "ZZACCOUNTNUMBERS"
protected $SSN_1 = (string[3]) "123"
protected $SSN_2 = (string[2]) "45"
protected $SSN_3 = (string[4]) "6789"
protected $EIN_1 = (string[2]) "EI"
protected $EIN_2 = (string[5]) "ZZEMP"
protected $SSN = (string[11]) "123-45-6789"
protected $EIN = (string[8]) "EI-ZZEMP"
}
You can notice some important differences with the templateless version :
All of the above have been defined in the template file, and the parent class, PdfToTextFormData, is able to handle any modifications made to any of the properties involved in a grouped property. For example, if you modify the SSN\_1 property, then the SSN property will be rebuilt accordingly. Similarly, if you modify the SSN property, then the SSN\_1, SSN\_2 and SSN\_3 properties will be changed accordingly.
This internal work is performed by the PdfToTextFormData, from which any class returned by the GetFormData() method inherits.
Now, this is time to have a look at what is a template. This is described in the next section.
Here is the typical form that has been designed to extract form data from our sample PDF file :
<?xml version="1.0" encoding="utf-8" ?>
<forms class="W9">
<form version="Form W-9 (Rev. December 2014)">
<field name="Name" form-field="f1_1" type="string"/>
<field name="BusinessName" form-field="f1_2" type="string"/>
<field name="FederalTaxClassification" form-field="c1_1" type="choice">
<case value="1" constant="TAXCLASS_INDIVIDUAL"/>
<case value="2" constant="TAXCLASS_C_CORPORATION"/>
<case value="3" constant="TAXCLASS_S_CORPORATION"/>
<case value="4" constant="TAXCLASS_PARTNERSHIP"/>
<case value="5" constant="TAXCLASS_TRUST_ESTATE"/>
<case value="6" constant="TAXCLASS_LIMITED_LIABILITY_COMPANY"/>
<default constant="TAXCLASS_UNDEFINED"/>
</field>
<field name="LLCClassification" form-field="f1_3" type="string"/>
<field name="OtherFederalTaxClassification" form-field="c1_7" type="choice">
<case value="7" constant="TAXCLASS_OTHER"/>
<default constant="TAXCLASS_UNDEFINED"/>
</field>
<field name="OtherFederalTaxInfo" form-field="f1_4" type="string"/>
<field name="ExemptPayeeCode" form-field="f1_5" type="string"/>
<field name="FATCAExemptionCode" form-field="f1_6" type="string"/>
<field name="Address" form-field="f1_7" type="string"/>
<field name="City" form-field="f1_8" type="string"/>
<field name="RequesterCoordinates" form-field="f1_9" type="string"/>
<field name="AccountNumbers" form-field="f1_10" type="string"/>
<field name="SSN_1" form-field="f1_11" type="string"/>
<field name="SSN_2" form-field="f1_12" type="string"/>
<field name="SSN_3" form-field="f1_13" type="string"/>
<field name="EIN_1" form-field="f1_14" type="string"/>
<field name="EIN_2" form-field="f1_15" type="string"/>
<group name="SSN" separator="-" fields="SSN_1, SSN_2, SSN_3"/>
<group name="EIN" separator="-" fields="EIN_1, EIN_2"/>
</form>
</forms>
The root tag is specified by <forms>. Subtags are <form> definitions.
Currently, the following are implemented :
Please note that the above information could be subject to changes in future releases.
Form fields can currently be of three types :
A (somewhat) tedious way to get the knowledge of which PDF form field contains which data is :
String fields within a form are basically specified with the following XML field construct :
<field name="SSN_1" form-field="f1_11" type="string"/>
The attributes are the following :
Choice fields are typically used for a group of checkboxes that behave like radiobuttons :
<field name="FederalTaxClassification" form-field="c1_1" type="choice">
<case value="1" constant="TAXCLASS_INDIVIDUAL"/>
<case value="2" constant="TAXCLASS_C_CORPORATION"/>
<case value="3" constant="TAXCLASS_S_CORPORATION"/>
<case value="4" constant="TAXCLASS_PARTNERSHIP"/>
<case value="5" constant="TAXCLASS_TRUST_ESTATE"/>
<case value="6" constant="TAXCLASS_LIMITED_LIABILITY_COMPANY"/>
<default constant="TAXCLASS_UNDEFINED"/>
</field>
They basically contain the same information as string fields, except that the type attribute is set to choice.
They can be paired to a set of constants defined either through the <case> or <default> tags :
Note that each defined constant, such as TAXCLASS\_INDIVIDUAL in the above example, will be defined as a class constant in the object returned by the GetFormData() method.
Grouped fields allow you to create new properties, coming from the concatenation of existing fields. A typical definition looks like this :
<group name="SSN" separator="-" fields="SSN_1, SSN_2, SSN_3"/>
In our template example, we have seen that the social security number (SSN) was splitted into three parts : SSN\_1, SSN\_2 and SSN\_3. The above example creates an SSN property, which is the result of the concatenation of the specified fields.
The required attributes are the following :
Note that modifying the value of a property referenced by a grouped field will modify the corresponding grouped field value. Similarly, modifying the grouped field value will modify its associated properties.
Sometimes, it's easier to tell the PdfToText class which area(s) of text you want to retrieve from which page(s), rather than having to struggle with regular expressions to isolate the information you want. This is especially true when you want to retrieve data from tabular reports.
Captures are a solution for such needs ; they allow you to define shapes of the following types :
Areas to be captured are specified using a capture definition file or string, in XML format.
Capturing areas of your PDF document will require you a few preliminary steps that involve some extra work. This is why you will have to choose between using captures or using regular expressions on the extracted text :
The step-by-step overview below references the files located in the examples/text-captures directory.
A PDF file uses a coordinate system whose values are more or less expressed in "relative units". The point at coordinates (0,0) is located at the bottom-left corner of the page ; the point at coordinates (x,y), where "x" is the page width and "y" the page height, is located at the top-right corner of the page.
That's nice, but how to find the coordinates (x, y, width and height) of the rectangle that contains the text you want to capture ? of course, if you have an Adobe product that allows you to modify PDF files, it may give you such information. But if you don't have such a tool, you're stuck.
The PdfToText class has a flag, PDFOPT\_DEBUG\_SHOW\_COORDINATES, that includes in the text output the (x,y) coordinates, width and height of each block of text found in the PDF stream. This may be the only occasion you will have to use this option. The following example script explains how the file sample-report.txt in the examples/text-capture directory of this package was generated :
<?php
include ( 'path/to/PdfToText.phpclass' ) ;
$pdf = new PdfToText ( 'sample-report.pdf', PDFOPT_DEBUG_SHOW_COORDINATES ) ;
file_put_contents ( 'sample-report.txt', $pdf -> Text ) ;
The first few lines of file sample-report.txt now contain things like this :
[Page : 1, width = 596, height = 843]
[x:248.76, y:760.4, w: 79.895, h:12]REPORT HEADER
[x:70.695, y:746.6, w: 2.381, h:12]
[x:84.495, y:722.6, w: 2.381, h:12]
[x:84.495, y:734.6, w: 2.381, h:12]
[x:0, y:708.08, w: 124.619, h:12]Column1 Column2 Column3
[x:70.8, y:690.32, w: 76.99, h:12]L1C1 L1C2 L1C3
[x:70.8, y:676.04, w: 76.99, h:12]L2C1 L2C2 L2C3
[x:70.8, y:661.76, w: 76.99, h:12]L3C1 L3C2 L3C3
[x:70.8, y:647.48, w: 76.99, h:12]L4C1 L4C2 L4C3
[x:70.8, y:633.2, w: 2.381, h:12]
[Page : 2, width = 596, height = 843]
[x:70.8, y:760.4, w: 2.381, h:12]
[x:0, y:745.88, w: 124.619, h:12]Column1 Column2 Column3
[x:70.8, y:731.72, w: 178.533, h:12]L1C1 (page 2) L1C2 (page 2) L1C3 (page 2)
[x:70.8, y:717.44, w: 178.533, h:12]L2C1 (page 2) L2C2 (page 2) L2C3 (page 2)
[x:70.8, y:703.16, w: 178.533, h:12]L3C1 (page 2) L3C2 (page 2) L3C3 (page 2)
[x:70.8, y:688.88, w: 178.533, h:12]L4C1 (page 2) L4C2 (page 2) L4C3 (page 2)
[x:70.8, y:674.6, w: 2.381, h:12]
instead of the default output with no specific options (see file sample-report.pdf for its graphic counterpart) :
REPORT HEADER
Column1 Column2 Column3
L1C1 L1C2 L1C3
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3
L4C1 L4C2 L4C3
Column1 Column2 Column3
L1C1 (page 2) L1C2 (page 2) L1C3 (page 2)
L2C1 (page 2) L2C2 (page 2) L2C3 (page 2)
L3C1 (page 2) L3C2 (page 2) L3C3 (page 2)
L4C1 (page 2) L4C2 (page 2) L4C3 (page 2)
In the output generated with the PDFOPT\_DEBUG\_SHOW\_COORDINATES option, you will notice some additional information between square brackets :
The first one appears at the start of each page :
[Page : 1, width = 596, height = 843] ... [Page : 2, width = 596, height = 843] ...
It gives the page number, its width and its height in graphics coordinates.
The second kind of information that appears between square brackets gives size information regarding the block of text immediately following it :
[x:248.76, y:760.4, w: 79.895, h:12]REPORT HEADER
The various information added to the output using the PDFOPT\_DEBUG\_SHOW\_COORDINATES option should give you enough data to design the capture definitions file.
An example is given below, which will capture some contents of the PDF file sample-report.pdf ; a more detailed explanation of this XML format is given in the XML Capture Definitions section :
<captures>
<rectangle name="Title">
<page number="1" left="245" right="360" top="760" bottom="740"/>
</rectangle>
<lines name="ReportLines" default="UNDEFINED" separator="\n">
<page number="1" top="690" height="16" bottom="0"/>
<page number="2..$" top="731" height="16" bottom="0"/>
<column name="Column1" left="70" width="600"/>
</lines>
</captures>
Basically, the above data says that it wants to capture two things :
You may have noticed that both of the <rectangle> and <lines> contain subtags named <page> ; they define which pages of the input PDF file are applicable to the parent tag. For example :
All the coordinates, widths and heights listed in this definition have been taken from the information contained in file sample-report.txt, generated at the previous step (the above section).
Captures are processed independently of text extraction ; the only two things you have to do is :
Call either the SetCaptures() or the SetCapturesFromString() method to load the capture definitions, as in the following example :
$pdf = new PdfToText ( 'sample-report.pdf', PdfToText::PDFOPT_CAPTURE ) ; $pdf -> SetCaptures ( 'sample-report.xml' ) ;
Retrieving capture data is fairly simple :
$captures = $pdf -> GetCaptures ( ) ;
This will return an object of type PdfToTextCaptureLines.
Each tag that has been defined in the XML definition file (sample-report.txt in our example) has a name ; for example, our rectangle shape has the name "Title", and the group of lines is named "ReportLines" (each report line having one column named "Column1").
Every capture name specified in the Capture definitions file can be accessed as a property name in the object returned by the GetCaptures() method :
$capture -> Title
$capture -> ReportLines
However, the way these various informations are processed depend on the capture type :
For Rectangle captures, the property will be an array, since the same capture can be defined on different pages at different locations. To retrieve the Title located on the first page of our PDF file, we will have to write :
$capture -> Title [1] -> Text
or :
( string ) $capture -> Title [1]
Rectangle captures are accessible by their page number. There will be a capture for each page of the document, even if not present in the list of applicable pages.
For lines captures, the situation is a little bit different : the interest of capturing report lines (together with their column data) is to be able to process them all at once. This is why they are grouped together in a single collection, that you can display or process with such kind of loops :
foreach ( $captures -> ReportLines as $line ) { foreach ( $line as $column ) // do something with $column -> Text }
Inside a line, you can also reference a column by its name :
$line -> Column1 -> Text
This section describes the format of an XML Capture definition.
The <captures> tag is the top-level node for the XML Capture definitions. It has no specific attributes and can contain one or more of the following subtags :
The <rectangle> tag defines a rectangular area used to capture text enclosed by it. It has the following attributes :
The <rectangle> tag must specify at least one <page> subtag, to indicate the PdfToText class on which pages of the document the captured text should be searched.
The <page> subtag of the <rectangle> tag has the following attributes :
number : Page number(s). This can be a comma-separated list of pages or ranges of pages separated by an ellipsis, as in the following example :
"1,2" "1,2..10"
The special character "$" means "the last page".
Expressions are allowed ; the following for example :
"1, $-9..$"
indicates that the rectangle area applies to page 1 and to the last 10 pages of the document.
The other attributes for the <page> tag are the following :
The <lines> tag allows to define a group of lines to be captured. It defines applicable pages, and provides definitions about each column to be captured. It is mainly used for capturing report lines and has the following attributes :
The <lines> tag can contain as many <page> and <column> subtags as needed.
The <page> subtag of the <lines> tag has more or less the same objective as the <page> subtag of the <rectangle> tag, with a few differences although. It is not used to define the shape of a rectangle, but rather the top and bottom coordinates of the first and last line of a page, along with the line height.
The following attributes are available :
The <column> subtag identifies a column position within a line. It has the following attributes :
Either the width or right attribute must be specified.
The object returned by the GetCaptures() method needs some explanation ; a distinction has been made between what is captured and how to access it.
The GetCaptures() method returns an object hierarchy that answers the how ; every object in this hierarchy contains objects that answer the what. The method returns an object of class PdfToTextCaptures, described below.
This class dynamically creates properties coming directly from the names specified with the <rectangle> and <lines> tags of the Capture XML definitions.
Thus, if we take our example definition file and call the GetCaptures() method :
$captures = $pdf -> GetCaptures ( ) ;
Then we will be able to access the properties defined in our capture file in the following way :
$captures -> Title
or :
$captures -> ReportLines
Here is the correspondance between Capture XML definitions and the properties found in this object :
This class allows to retrieve information about the text captured based on a <rectangle> definition. It behaves like an array containing elements of type PdfToTextCapturedRectangle.
The reason why this class is an array is because the rectangle shape can be used to capture text at different locations on different pages.
This class also behaves as an array that gives elements of type PdfToTextCaptureLine. It contains the lines that have been captured on the whole document.
This is the base abstract class for the various shapes captured in the document (PdfToTextCapturedRectangle, etc.). It has the following properties :
Inherits from the PfToTextCapturedText class. It currently does not add any new properties or methods.
Inherits from the PfToTextCapturedText class. Adds the following properties :
An object of this class can be iterated like an array. It also provides direct access to a PdfToTextCapturedColumn object by specifying its name as a property ; for example :
$captured_line -> Column1
Of course, the column name has been specified in the Capture XML definition, with the name attribute of the <column> tag.
Files (57) |
File | Role | Description | ||
---|---|---|---|---|
CIDTables (6 files) | ||||
contributions (1 file) | ||||
examples (1 file, 6 directories) | ||||
FontMetrics (14 files) | ||||
FormTemplates (1 file) | ||||
images (1 file) | ||||
Maps (2 files) | ||||
CREDITS.md | Data | Auxiliary data | ||
HISTORY.txt | Doc. | Documentation | ||
LICENSE | Lic. | License | ||
NOTICE | Data | Auxiliary data | ||
PdfToText.phpclass | Class | Class source | ||
README.md | Doc. | Documentation |
Files (57) | / | CIDTables |
File | Role | Description |
---|---|---|
IDENTITY-H-ABCDEE.cid | Data | Auxiliary data |
IDENTITY-H-empty.cid | Data | Auxiliary data |
IDENTITY-H-GQJGLM.cid | Data | Auxiliary data |
IDENTITY-H-PISJAS.cid | Data | Auxiliary data |
IDENTITY-H-TSPALS.cid | Data | Auxiliary data |
IDENTITY-H.cid | Data | IDENTITY-H CID table map |
Files (57) | / | examples |
File | Role | Description | ||
---|---|---|---|---|
example01 (4 files) | ||||
example02 (4 files) | ||||
example03 (4 files) | ||||
formdata-extraction (4 files) | ||||
image-extraction (4 files) | ||||
text-capture (5 files) | ||||
README.md | Data | Auxiliary data |
Files (57) | / | examples | / | example01 |
File | Role | Description |
---|---|---|
example.php | Example | Example script |
README.md | Doc. | Documentation |
sample.pdf | Data | Auxiliary data |
sample.txt | Doc. | Documentation |
Files (57) | / | examples | / | example02 |
File | Role | Description |
---|---|---|
example.php | Example | Example script |
README.md | Doc. | Documentation |
sample.pdf | Data | Auxiliary data |
sample.txt | Doc. | Documentation |
Files (57) | / | examples | / | example03 |
File | Role | Description |
---|---|---|
example.php | Example | Example script |
README.md | Data | Auxiliary data |
sample.pdf | Data | Auxiliary data |
sample.txt | Doc. | Documentation |
Files (57) | / | examples | / | formdata-extraction |
File | Role | Description |
---|---|---|
example.php | Example | Example script |
README.md | Doc. | Documentation |
sample.pdf | Data | Auxiliary data |
sample.xml | Data | Auxiliary data |
Files (57) | / | examples | / | image-extraction |
File | Role | Description |
---|---|---|
example.php | Example | Example script |
README.md | Doc. | Documentation |
sample.pdf | Data | Auxiliary data |
sample.txt | Doc. | Documentation |
Files (57) | / | examples | / | text-capture |
File | Role | Description |
---|---|---|
example.php | Example | Example script |
README.md | Doc. | Documentation |
sample-report.pdf | Data | Auxiliary data |
sample-report.txt | Data | Auxiliary data |
sample-report.xml | Data | Auxiliary data |
Files (57) | / | FontMetrics |
File | Role | Description |
---|---|---|
courier.fm | Data | Auxiliary data |
courierb.fm | Data | Auxiliary data |
courierbi.fm | Data | Auxiliary data |
courieri.fm | Data | Auxiliary data |
helvetica.fm | Data | Auxiliary data |
helveticab.fm | Data | Auxiliary data |
helveticabi.fm | Data | Auxiliary data |
helveticai.fm | Data | Auxiliary data |
symbol.fm | Data | Auxiliary data |
times.fm | Data | Auxiliary data |
timesb.fm | Data | Auxiliary data |
timesbi.fm | Data | Auxiliary data |
timesi.fm | Data | Auxiliary data |
zapfdingbats.fm | Data | Auxiliary data |
Files (57) | / | Maps |
File | Role | Description |
---|---|---|
adobe-charsets.map | Data | Auxiliary data |
unicode-to-ansi.map | Data | Auxiliary data |
The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page. |
Install with Composer |
Version Control | Unique User Downloads | Download Rankings | |||||||||||||||
100% |
|
|
User Ratings | User Comments (7) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Applications that use this package |
If you know an application of this package, send a message to the author to add a link here.