com.itextpdf.text.pdf.parser
Class LocationAwareTextExtractingPdfContentRenderListener

java.lang.Object
  extended by com.itextpdf.text.pdf.parser.LocationAwareTextExtractingPdfContentRenderListener
All Implemented Interfaces:
RenderListener, TextProvidingRenderListener

public class LocationAwareTextExtractingPdfContentRenderListener
extends Object
implements TextProvidingRenderListener

Development preview - this class (and all of the parser classes) are still experiencing heavy development, and are subject to change both behavior and interface.
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
This renderer keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation. Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance, but different parallel distance is separated by tab characters.
If text is relatively close to each other on the same line (within 4 space widths), the text is kept together (separated with a single space).
This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.

Since:
5.0.0

Nested Class Summary
private static class LocationAwareTextExtractingPdfContentRenderListener.LocationOnPage
          Represents a chunk of text, it's orientation, and location relative to the orientation vector
 
Field Summary
private  Vector chunkEnd
          the most recent ending point of the current chunk of text
private  Vector chunkStart
          the starting point of the current line of text
private  StringBuffer chunkText
          contains the text accumulated so far for the current chunk
(package private) static boolean DUMP_STATE
          set to true for debugging
(package private)  boolean firstRender
          whether the operation is the first render of the page
private  List<LocationAwareTextExtractingPdfContentRenderListener.LocationOnPage> locationalResult
          a summary of all found text
 
Constructor Summary
LocationAwareTextExtractingPdfContentRenderListener()
          Creates a new text extraction renderer.
 
Method Summary
 void beginTextBlock()
          Called when a new text block is beginning (i.e.
private  void captureChunk(String text)
          Captures the specified text as a single, cohesive chunk of text using the current line start and end information
private  void dumpState()
          Used for debugging only
 void endTextBlock()
          Called when a text block has ended (i.e.
 String getResultantText()
          Returns the result so far.
 void renderImage(ImageRenderInfo renderInfo)
          no-op method - this renderer isn't interested in image events
 void renderText(TextRenderInfo renderInfo)
          Captures text using a relatively advanced algorithm for determining text chunks and spaces
 void reset()
          Resets the internal state
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DUMP_STATE

static boolean DUMP_STATE
set to true for debugging


chunkStart

private Vector chunkStart
the starting point of the current line of text


chunkEnd

private Vector chunkEnd
the most recent ending point of the current chunk of text


chunkText

private StringBuffer chunkText
contains the text accumulated so far for the current chunk


locationalResult

private List<LocationAwareTextExtractingPdfContentRenderListener.LocationOnPage> locationalResult
a summary of all found text


firstRender

boolean firstRender
whether the operation is the first render of the page

Constructor Detail

LocationAwareTextExtractingPdfContentRenderListener

public LocationAwareTextExtractingPdfContentRenderListener()
Creates a new text extraction renderer.

Method Detail

reset

public void reset()
Resets the internal state

Specified by:
reset in interface RenderListener
See Also:
RenderListener.reset()

beginTextBlock

public void beginTextBlock()
Description copied from interface: RenderListener
Called when a new text block is beginning (i.e. BT)

Specified by:
beginTextBlock in interface RenderListener
See Also:
RenderListener.beginTextBlock()

endTextBlock

public void endTextBlock()
Description copied from interface: RenderListener
Called when a text block has ended (i.e. ET)

Specified by:
endTextBlock in interface RenderListener
See Also:
RenderListener.endTextBlock()

getResultantText

public String getResultantText()
Returns the result so far.

Specified by:
getResultantText in interface TextProvidingRenderListener
Returns:
a String with the resulting text.

dumpState

private void dumpState()
Used for debugging only


renderText

public void renderText(TextRenderInfo renderInfo)
Captures text using a relatively advanced algorithm for determining text chunks and spaces

Specified by:
renderText in interface RenderListener
Parameters:
renderInfo - render info

captureChunk

private void captureChunk(String text)
Captures the specified text as a single, cohesive chunk of text using the current line start and end information

Parameters:
text -

renderImage

public void renderImage(ImageRenderInfo renderInfo)
no-op method - this renderer isn't interested in image events

Specified by:
renderImage in interface RenderListener
Parameters:
renderInfo - information specifying what to render
Since:
5.0.1
See Also:
RenderListener.renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)

Hosted by Hostbasket