[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: proofing searchable pdf files



On 11/01/2014 06:35 PM, Scott Ferguson wrote:
On 31/10/14 11:47, Gary Roach wrote:
Hi all,

Problem: I am working on an archiving project and wish to archive
documents to searchable pdf files but can't seem to figure out how to
proof read and correct the text overlay. Any suggestions.
I'm not sure what you mean by "text *overlay*"... but, my usual approach
is to only edit the text content of the final output if the font is
unique - otherwise I feed the to problematic text back into the training
data.
https://code.google.com/p/tesseract-ocr/wiki/AddOns

System: Debian Wheezy Intel i5-750 processor HP Officejet Pro 8600
wireless all in one printer/fax/scanner gscan2pdf software with
Tesseract ocr 300 to 600 dpi scans.

Tesseract seems to do a really great job but I have no good way of
proving this or correcting any mistakes.
Are they the only tesseract components you have installed??
What are the project constraints that prevent you from using the
traditional toolsets for similar projects (what you have listed is
better suited to scanning a few pages only)??

e.g. is there a reason you are not using Terese, YAGF or Lector (or any
of the other fine interfaces that allow proof-reading)?
http://terese.sourceforge.net/
http://code.google.com/p/yagf/
https://code.google.com/p/lector/

What about the standard box file editor and traners(sic, trainers?):-
https://code.google.com/p/tesseract-ocr/wiki/AddOns

Some of the documents are 100 years old and may not be in such great
shape. I can always retype everything but would like to avoid this,
as much as possible, for obvious reasons.

Gary R.



Given that the default output is a standard utf-8 text file.... why are
people proposing convoluted processes to edit the text in a pdf?

Do you /have/ to tif -> pdf immediately??

It would make more sense from my experience of working with tesseract
and auto-bookscanners to just generate the tif files, then proof-read,
then convert to the final output format (puzzled).


Kind regards


This whole process is new to me and I am struggling to get my feet on the ground. I just came to the same conclusion about trying to proof pdf's instead of using the raw tiff files. Thank you for the list of alternatives to Tesseract. Iwill check them out. I am a bit unsure about the "Tesseract tool set" and need to do more research into this area. One of the hardest things about developing an new skill set for computers is finding the correct software and documentation. I'm still working on this.

Thanks

Gary R.


Reply to: