Infodex PDF Advantage
Infodex PDF Search is unlike the multitude of PDF search applications available. Infodex is a purpose-built Information Retrieval platform that efficiently collects, organizes, analyzes, searches and displays PDF documents. Infodex unlocks and exposes the information trapped inside of PDFs. The Infodex Navigator’s built-in PDF display and innovative search features makes searching pdf effortless, fast and accurate! Other PDF applications simply don’t match the capabilities, ease of use, and performance of Infodex Search.
Inside PDF
Portable Document Format (PDF) is a file format used to present and exchange documents reliably, independent of software, hardware, or computing platform. Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). A PDF can contain text, links, buttons, form fields, audio, video, and business logic. They can be electronically signed and are easily viewed using free Acrobat Reader DC software, web browsers, etc.
PDF Benefits are HUGE
Even if you or your organization isn’t generating massive amounts of PDF, the world is. Tomes of original and historical work has been migrated to PDF. Why? There are big benefits:
- Preserves original document visual fidelity
- Helps establish a document’s official version and publication date
- Protects the original source document while allowing a facsimile (PDF) to be published.
- Optimized file size for distribution
- Contain a wide variety of content types (text, links, graphics, sounds, Workflows, digital signatures)
- Platform independent (Windows, Mac, IOS, Android, etc…)
- Built in security with optional password protection
- Evolving Standard
- Defacto world-wide adoption. PDF isn’t going away!
Not all PDF is Created Equal
How a PDF is visually laid out is up to the designer, but how the PDF file is built internally is up to the tools used to create the PDF. Like an artist with a blank canvas, a PDF is created in a similar manner. For example, the color black is used to paint lines, objects and text, followed then by other colors and objects. The PDF tool decides to how to draw the page, usually in an order that optimizes computer resources rather than concern for document viewing performance. Upon completion, when all page items have been described, the finished “masterpiece” file can be used. The final visual representation may appear correct, but cheap tools, typically create poor quality PDFs. For these and other reasons, some PDFs are highly unoptimized in size, object ordering, and layout and contributes to slow page drawing and document navigation.
OCR (Optical Character Recognition)
The primary aim of PDF is to recreate faithfully the original visual representation of a document. However, the representation itself isn’t directly searchable. A text layer is required to contain the searchable text information. To enable search, a couple of options exist.
First, the preferred approach is to have the source document application (i.e., MS Word, Adobe InDesign, etc.) create the PDF and directly embed the document TEXT onto the PDF page. This guarantees the resulting PDF page text is a direct copy of the original document text. Correspondingly, the searchability of the PDF will be as good as the source.
Second, a popular method of creating PDF is from a collection of image files or scanned paper documents. However, a PDF containing only images, cannot be directly searched. To enable search, an additional processing stage called OCR (Optical Character Recognition) is required. The OCR process examines the images on a page using pattern recognition techniques to identify text. The text is then rewritten back onto the page as hidden, but searchable text. However, even using the most advanced OCR techniques, the output text is usually not 100% accurate, due to image quality, layout, styling, etc. A validation stage can further improve OCR text by checking spelling and grammar.
Infodex knows and understands PDF and will help you finally take advantage of your large PDF collections.