core-jgi/fitnesse/FitNesseRoot/FitLibraryWeb/PdfDocument/content.txt

91 lines
2.7 KiB
Plaintext

The PdfDocument fixture allows for the text within a PDF file to be checked.
Paragraph information is no longer explicit in a PDF.
* PdfDocument uses some simple heuristics to try and segment the text into paragraphs.
* However, it is weak at doing this in general.
Here's a document that discusses the issues in accessing text from PDFs and tuning the paragraph-segmenting heuristics in PdfDocument:
* http://files/pdf/eg.pdf.
#
!2 Example
#
Here's a >RunningExample
#
!2 Commands
#
!3 1. Start, Open, Close
#
* Start checking PDF:
|''with PDF''|
* Open a PDF file:
|''open''|Submission.pdf|
* Finish processing the pdf by closing the file:
|''close PDF file''|
#
!3 2. Pages
#
* Confirm the number of pages:
|''number of pages''|''is''|2|
* Select a specific page:
|''select page''|1|
* Select all pages
|''select all pages''|
#
!3 3. Checking for text anywhere
#
* Show the text of the current page(s):
|'''show'''|''text''|
* Check that a string appears somewhere in the text in the current page(s):
|''text''|''contains''|Thanks for your submission|
* Check that the regular expression appears somewhere in the text in the current page(s):
|''text''|''matches''|Thanks for yo.* submission|
#
!3 4. Paragraphs
#
* Select the text below a given heading and up to the next heading (can also use '''contains''' and '''matches''':
|''paragraph below heading''|Follow Up:|''is''|We will contact you in the next few days to provide feedback on your submission.|
* Select the text below a given heading and up to the next heading (can also use '''contains''' and '''matches''':
|''paragraph after containing''|Conclusions|''contains''|There's no more to say on this topic.|
* Check a range of paragraphs:
|paragraphs from|0|to|3|
|Extracting Paragraphs from PDF Files|
|PDF Format|
|PDFs do not retain paragraph, heading, etc information. Instead, a PDF encodes a rendered form of a document, rather like the individual characters that are rendered on a screen. A PDF file can be thought of as containing a sequence of pieces of information. Each piece of text is located at a particular (x,y,z) position, along with font information.|
|Depending on the application writing the PDF file, a word may be added as a single word, or it may be added as several substrings. Some applications tend to add each of the characters of a word separately.|
#
!3 5. Dump Image
#
* Dump an image of the PDF and include it in the storytest report:
|''show pdf as image''|
This only works with some PDFs. For example, the PDF provided in the ^RunningExample doesn't display the characters.
#
!2 Implementation
#
This uses the apache open-source ''pdfbox'' system.
See http://incubator.apache.org/pdfbox/ for further details.