Projects / PDFTextStream

PDFTextStream

PDFTextStream is a PDF text and metadata extraction library available for Java, Python, and .NET. It supports all versions of the PDF document specification, (including v1.7, used by Acrobat 8 and 9), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of 40-bit and 128-bit encrypted documents, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included, as well as interactive form update capability.

Tags Information Management Document Repositories Internet Web Indexing/Search Software Development Libraries Java Libraries PDF Java .NET
Operating Systems Unix Windows Windows Windows Cygwin Mac OS X POSIX Solaris Linux HP-UX BSD OpenBSD NetBSD FreeBSD BSD/OS OS Independent
Implementation Java .NET

Tweet this project Short link

Rss Recent releases

  • Rrelease-mid
  •  23 Apr 2009 14:42
  • Rrelease-after

    Changes: An .isStruckThrough() method was added to com.snowtide.pdf.TextUnit, indicating whether a character has a strikethrough drawn through it. PDFTextStream's support for embedded character mappings was improved. The calculation of whitespace between words has been fixed to properly account for whitespace that is explicitly encoded in the source PDF documents. PDFTextStream's handling of composite content encodings was improved, which previously could fail resulting in some ranges of PDF content being "ignored" during extraction.

    • Rrelease-mid
    •  30 Dec 2008 19:16
    • Rrelease-after

    Changes: This release adds support for extracting XFA forms data as XML. It significantly improves the performance of text extraction using VisualOutputTarget. Support for PDF documents larger than 2GB. A fix for a bug where the encodings from embedded Type1 fonts were previously not being applied properly in some circumstances. A fix for a problem where newer content in updated PDF documents was sometimes being ignored. A fix for a problem where PDFDocEncoding-encoded bookmarks and metadata were not being decoded properly. A .getDestinationName() method in com.snowtide.pdf.Bookmark.

    • Rrelease-mid
    •  05 Apr 2007 08:59
    • Rrelease-after

    Changes: Support was added for updating text, checkbox, radio button, and choice interactive form fields. Support was added for Kodak print job data extraction (%KDK commands) via com.snowtide.pdf.util.KodakPrintData. The AcroFormField.isReadOnly() function was exposed. ByteBuffer-based buildPDFDocument() functions were added to com.snowtide.pdf.lucene.PDFDocumentFactory. The documentation was improved significantly.

    • Rrelease-mid
    •  28 Mar 2007 10:36
    • Rrelease-after

    Changes: This release fixes handling of text spacing that was causing some columnated text to overrun column boundaries improperly. It fixes a problem where text from adjacent lines would be inappropriately intermingled. Unlicensed functionality has been changed so that evaluation use does not require a special evaluation license file; specifically, PDFTextStream will randomize some digits in text extracts when it is operating unlicensed, and the 8-page extract limitation has been removed.

    • Rrelease-mid
    •  07 Dec 2006 15:12
    • Rrelease-after

    Changes: This release adds a com.snowtide.pdf.RegionOutputTarget to support region-specific content extraction. It adds the ability to derive encoding and spatial metrics of Type3 fonts. It adds a pdfts.type3.derive system property to disable derivation if necessary. A problem with com.snowtide.pdf.VisualOutputTarget, where lines would sometimes be inappropriately combined, has been fixed.

    Ee32f835e0097414c4d3f6846fa8e064_thumb

    Project Spotlight

    Stella

    An Atari 2600 VCS emulator.

    No-screenshot

    Project Spotlight

    check_procs_multi

    A Nagios plugin like check_procs, but able to check several processes at once.