Quick links
AAA Copy and Document Management is now bringing a new solution to storage problems: confidential document scanning and conversion.
What is a Searchable PDF?
A searchable PDF file is a PDF file that includes text that can be searched upon using the standard Adobe Reader “search” functionality. In addition, the text can be selected and copied from the PDF. Generally, PDF files created from Microsoft Office Word and other documents are by their nature searchable as the source document contains text which is replicated in the PDF, but when creating a PDF from a scanned document and OCR process needs to be applied to recognize the characters within the image.
Inside a Searchable PDF
In the context of Document Imaging, a searchable PDF will typically contain both the original scanned image plus a separate text layer produced from an OCR process. The text layer is defined in the PDF file as invisible, but can still be selected and seached upon.
OCR Accuracy
A number of factors affect the accuracy of the text produced by the OCR process – 100% accuracy is certain possible under good conditions but each of the following issues, and OCR processing options will have an impact.
Original Image Quality
Although some pre-processing options such as despeckle and deskew can help in some cases, the visual quality of the original scan is of paramount importance.
Image DPI and Format
The image resolution should be at least 150 DPI for OCR processing, and preferably 300 DPI for optimal results, although for good quality scans 200 DPI is often sufficient. Non-lossy formats (TIFF Group 4, LZW etc) are preferred over lossy formats such as JPEG compression.
Despeckle
This pre-processing option removes isolated “dots” within the image which can cause recognition problems, and makes the result image “cleaner”.
Deskew
This option can improve OCR results by straightening crooked pages.
Auto-Rotate
OCR processing usually recognizes text written top-to-bottom, left-to-right, so pages that are orientated any other way (usually landscape pages) need to be re-oriented to enable recognition.
Language Settings
The language setting determines the set of characters that will be recognized, and the dictionary that will be used as a guide.
