[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: proofing searchable pdf files



On 04/11/14 12:17, Gary Roach wrote:
> On 11/01/2014 06:35 PM, Scott Ferguson wrote:
>> On 31/10/14 11:47, Gary Roach wrote:
>>> Hi all,
>>>
>>> Problem: I am working on an archiving project and wish to archive
>>> documents to searchable pdf files but can't seem to figure out how to
>>> proof read and correct the text overlay. Any suggestions.

<snipped>
>>
>>
> This whole process is new to me and I am struggling to get my feet on
> the ground. 

I /thought/ I knew what I was up against when I first worked with
Tesseract[*1] - given my previous experience with several very large OCR
projects. Wrong! :(
Then, after my first Tesseract OCR project I /thought/ I was better
informed[*2].... (sigh).  :)

Hence my questions about constraints.

[*1] built my own auto-book scanner
[*2] worked on a project where volunteers had previously "scanned"
documents and "tried" to use Tesseract. :/

> I just came to the same conclusion about trying to proof
> pdf's instead of using the raw tiff files. Thank you for the list of
> alternatives to Tesseract. 

They are not "alternatives" to Tesseract - just alternative "interfaces"
to the Tesseract engine.

> Iwill check them out. I am a bit unsure about
> the "Tesseract tool set" and need to do more research into this area.
> One of the hardest things about developing an new skill set for
> computers is finding the correct software and documentation. I'm still
> working on this.

Though I don't know the specifics of the project, may I suggest,
resources allowing, the following approach:-
;scan the pages as high-quality PNG images - keep the PNG originals[*1]
;try various processing methods before converting to TIFF (to get the
clearest separation of 2 colours)
;keep track of the various image versions[*1] - you'll find the
scan/convert/OCR/edit process is iterative
;feed TIFF to tesseract using the management interface of your choice -
create an index of the fonts used in the books you are processing, if
there's more than a couple of pages of a font-type spend some time on
teaching tesseract the font (much quicker than post-editing every
miss-read).

[*1] Especially useful for last edit layout checking.
[*2] I found Digikam invaluable for this purpose.


> 
> Thanks
> 
> Gary R.
> 
> 

Hope that helps, I'd be very interested in the outcome if you wouldn't
mind contacting me offlist.

Kind regards



--
"Turns out you can't back a winner in the Gish Gallop" ~ disappointed punter


Reply to: