[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: proofing searchable pdf files



On 10/30/2014 05:47 PM, Gary Roach wrote:
Hi all,

Problem:
I am working on an archiving project and wish to archive documents to searchable pdf files but can't seem to figure out how to proof read and correct the text overlay. Any suggestions.


Tesseract seems to do a really great job but I have no good way of proving this or correcting any mistakes. Some of the documents are 100 years old and may not be in such great shape. I can always retype everything but would like to avoid this, as much as possible, for obvious reasons.

Gary R.


OK more detail.

First, searchable pdf files are a 2 layer file with the pdf vector graphics layer overlaying a text file. I have tried gimp but have not been able to separate the layers. Tesseract will show the text file but in box format. This seems to be Tesseract's native file structure (guessing) and is virtually unusable for proof reading. I have been able to use Dolphin and Okular to get rid of the boxes but Okular just replaces them with long strings of dots - also unusable for proof reading.

Transfer of the pdf file to LibreOffice writer produces garbage.

This is part of a medium sized, low budget archiving project that will process serveral thousand documents, all done by low tech volunteers. So I really need methods that are straight forward or can be automated to the idiot level. A method that will split the vector graphics and text files apart, allow editing of the text file and reassembling of the file is needed. I am having trouble believing that there isn't software out there that will do this but I have not been able to find it.

Your comments so far have pointed me in several different directions but I still haven't found an efficient (or even viable) editing method.

Your help is really appreciated.

Gary R.


Reply to: