PDE
The PDE (Pdf Data Extractor) allows the extraction of information and tables optionally based on search words from PDF (Portable Document Format) files and enables the visualization of the results, both by providing a convenient user-interface.
Install / Use
/learn @erikstricker/PDEREADME
Description
PDE is a R package that easily extracts information and tables from PDF
files. The PDE_analyzer_i() performs the sentence and table extraction
while the included PDE_reader_i() allows the user-friendly
visualization and quick-processing of the obtained results.
Installation
Install the dependent packages
install.packages("tcltk2") # Install the dependent package tcltk2
The package requires the Xpdf command line tools by Glyph & Cog, LLC. Please download and install the Xpdf command line tools 4.2 from the following website onto your local disk: https://github.com/erikstricker/PDE/tree/master/inst/examples/bin (https://github.com/erikstricker/PDE/tree/master/inst/examples/bin). Alternatively, the following command can be used to install the correct Xpdf command line tools:
PDE_install_Xpdftools4.02() # Download and install the Xpdf command line tools
PDE_check_Xpdf_install() # Check if all required XPDF tools are installed correctly
Install the package through CRAN
install.packages("PDE", dependencies = TRUE)
or choose the location where you downloaded latest PDE_*.*.*.tar.gz and install it from a local path.
filename <- file.choose() # Choose the location where you downloaded the latest PDE_*.*.*.tar.gz
install.packages(filename, type="source", repos=NULL)
NOTE: The PDE package was tested on Microsoft Windows, Mac and Linux machines. Major differences include the visual appearance of the interfaces and the directory structures, but all functions are preserved.
Execution
The PDE analyzer can be accessed through different functions which are outlined below.
PDE_analyzer()
PDE_analyzer_i()
PDE_extr_data_from_pdfs()
PDE_pdfs2table()
PDE_pdfs2table_searchandfilter()
PDE_pdfs2txt_searchandfilter()
The PDE reader is only available as an interactive user interface requiring the R package tcltk2.
PDE_reader_i()
NOTE: For problem solution concerning a potential error when starting
PDE_analyzer_i() or PDE_reader_i() on Mac see
Troubleshoot - Error when starting interactive user
interface on Mac (failed to allocate tcl
font).
Quick guide to get started
PDE_analyzer_i()
- Run
library("PDE")
PDE_analyzer_i()
<img src="vignettes/scrnshots/Screenshot_PDE_analyzer_user_interface_empty_mac.png" width="50%" style="display: block; margin: auto;" />
<center>
`PDE_analyzer_i()` user interface on Mac
</center>
</br>
- This should open a user interface.
- Fill out the form from top to bottom (standard parameters are preselected).
- The filled form can and should be saved as a TSV file at any time.
This can be done by clicking the Save form as tsv button at the
top, center of the form.
NOTE: Choose an empty folder or create a new one as the output directory, since analyses create at least a number of files equal to the number of PDF files analyzed.
PDE_reader_i()
<img src="vignettes/scrnshots/Screenshot_PDE_reader_user_interface_empty_linux.png" width="50%" style="display: block; margin: auto;" /> <center> `PDE_reader_i()` user interface on Linux </center> </br>- Run
library("PDE")
PDE_reader_i()
-
This should open a user interface.
-
Load either a sentence analysis file or a folder with such files.
NOTE: Analysis files refer to the files created by the PDE_analyzer_i() which contain “txt+-” in their name. -
The user can browse through all analysis files in the folder to get an overview over the data.
-
Additional functions can be enabled by loading the PDF folder as well as the TSV file used for analysis.
NOTE: Flagging and marking changes filenames but can be reversed in the program at any time.
Parameters
PDE_analyzer_i()
NOTE: Arguments for the R function PDE_extr_data_from_pdfs() are
listed below each description: argument
- Run
library("PDE")
PDE_analyzer_i()
Choose the locations for the required files:
<img src="vignettes/scrnshots/Screenshot_PDE_analyzer_user_interface.choose_variables_empty.png" width="100%" style="display: block; margin: auto;" /> <center> `PDE_analyzer_i()` user interface - Choose the locations for the required files </center> </br>-
Load form from tsv OR Save form as tsv: The filled form can and should be saved as a TSV file at any time, accordingly the saved parameters can be loaded from saved TSV files.
-
Reset form: This will clear all fields and variables.
Input/Output:
<img src="vignettes/scrnshots/Screenshot_PDE_analyzer_user_interface_empty_win.png" width="100%" style="display: block; margin: auto;" /> <center> `PDE_analyzer_i()` user interface - Input/Output </center> </br>-
Select PDF folder: Open a folder with PDF files you want to analyze. For the analysis, all PDF files in the folder and subfolders will be analyzed.
or
Load PDF files: Select one or more PDF files you want to analyze (use Ctrl and/or Shift to select multiple). Multiple PDF files will be separated by ; without a space.
Argument forPDE_extr_data_from_pdfs():pdfs -
Select output folder: All analysis files will be created inside of this folder; therefore, choose an empty folder or create a new one as output directory, since analyses create at least a number of files equal to the amount of PDF files analyzed. If no output folder is chosen, the results will be saved in the R working directory.
Argument forPDE_extr_data_from_pdfs():outor Open output folder: To have a look at the output files or generally the contents of the output folder click here. The dialog will open the output folder with the standard file explorer. -
Choose the output format: The resulting analyses files can either be generated as comma-separated values files (.csv) or tab-separated values files (.tsv), with the former being easier to open and save in Microsoft Excel, while the later leads to less errors when opening in Microsoft Excel (as tabs are rare in texts). Depending on the operational system the output file are opened in, it is recommended to choose the Microsoft Windows (WINDOWS-1252), Mac (macintosh) or Linux (UTF-8) encoding.
Argument forPDE_extr_data_from_pdfs():out.table.format -
Adjust options in the tabs above: For available options see below.
-
Start analysis: When pressing the “Start analysis” button processing through the
PDE_analyzer_i()will begin and the button will change to “Pause analysis”. Pausing of the analysis is generally delayed until the processing of the current files is finished. While paused the button will change to “Resume analysis”. At any time the analysis can be aborted by pressing “Stop analysis”. In addition to the analysis output files in the folders a summary file titledPDE_analyzer_word_stats.csvwill be generated with search word and filter word statistics. -
Close session: The
PDE_analyzer_i()can be closed with this button. While analysis is running the processing can be aborted by pressing this button which will carry the caption “Stop analysis”.
Search Words:
<img src="vignettes/scrnshots/Screenshot_PDE_analyzer_user_interface.search_words_empty.png" width="100%" style="display: block; margin: auto;" /> <center> `PDE_analyzer_i()` user interface - Search Words </center> </br>-
Choose what to extract: The PDE analyzer has 2 main functions A] PDF2TXT (extract sentences from pdf) and B] PDF2TABLE (table of PDF to excel file) which can be combined or executed separately. Each function can be combined with filters and search words. A file with the sentences carrying the search words will have the name format:
[search words]txt+-[context][PDF file name]in the corresponding subfolder. Tables will be named:[PDF file name][number of table][table heading].
Argument forPDE_extr_data_from_pdfs():whattoextr -
Search words?: The algorithm can either extract , tables, or sentences and tables with one of the search words present. If the “tables” only analysis was chosen, the algorithm can also extract all tables detected in the paper (choose this option here). In the later case, the search words field should remain empty.
-
Save table by category: If search word categories are added and table extracti
