core-jgi/fitnesse/FitNesseRoot/FitLibraryWeb/PdfDocument/content.txt

The PdfDocument fixture allows for the text within a PDF file to be checked.

Paragraph information is no longer explicit in a PDF.

 * PdfDocument uses some simple heuristics to try and segment the text into paragraphs.
 * However, it is weak at doing this in general.

Here's a document that discusses the issues in accessing text from PDFs and tuning the paragraph-segmenting heuristics in PdfDocument:

 *  http://files/pdf/eg.pdf.
#
!2 Example
#
Here's a >RunningExample
#
!2 Commands
#
!3 1. Start, Open, Close
#
 * Start checking PDF:

|''with PDF''|

 * Open a PDF file:

|''open''|Submission.pdf|

 * Finish processing the pdf by closing the file:

|''close PDF file''|
#
!3 2. Pages
#
 * Confirm the number of pages:

|''number of pages''|''is''|2|

 * Select a specific page:

|''select page''|1|

 * Select all pages

|''select all pages''|
#
!3 3. Checking for text anywhere
#
 * Show the text of the current page(s):

|'''show'''|''text''|

 * Check that a string appears somewhere in the text in the current page(s):

|''text''|''contains''|Thanks for your submission|

 * Check that the regular expression appears somewhere in the text in the current page(s):

|''text''|''matches''|Thanks for yo.* submission|
#
!3 4. Paragraphs
#
 * Select the text below a given heading and up to the next heading (can also use '''contains''' and '''matches''':

|''paragraph below heading''|Follow Up:|''is''|We will contact you in the next few days to provide feedback on your submission.|

 * Select the text below a given heading and up to the next heading (can also use '''contains''' and '''matches''':

|''paragraph after containing''|Conclusions|''contains''|There's no more to say on this topic.|

 * Check a range of paragraphs:

|paragraphs from|0|to|3|
|Extracting Paragraphs from PDF Files|
|PDF Format|
|PDFs do not retain paragraph, heading, etc information. Instead, a PDF encodes a rendered form of a document, rather like the individual characters that are rendered on a screen. A PDF file can be thought of as containing a sequence of pieces of information. Each piece of text is located at a particular (x,y,z) position, along with font information.|
|Depending on the application writing the PDF file, a word may be added as a single word, or it may be added as several substrings. Some applications tend to add each of the characters of a word separately.|
#
!3 5. Dump Image
#
 * Dump an image of the PDF and include it in the storytest report:

|''show pdf as image''|

This only works with some PDFs. For example, the PDF provided in the ^RunningExample doesn't display the characters.
#
!2 Implementation
#
This uses the apache open-source ''pdfbox'' system.

See http://incubator.apache.org/pdfbox/ for further details.