[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: document archiving w/ scanner



On Sat, Jul 10, 2004 at 12:15:28AM +0200, martin f krafft wrote:
> also sprach Andrew Perrin <clists@perrin.socsci.unc.edu> [2004.07.09.2221 +0200]:
> > Correct - if you want searchable text you need some OCR filter.
> > I've used gocr with some, moderate, success, but it's by no means
> > perfect. Others have recommended clara, which is probably better
> > but requires too much user involvement for my taste!
> 
> Yes, I am starting to notice that we need to get into the OCR
> domain. I am new to scanning, so please excuse me not making that
> jump before posting.
> 
> So far it sounds like HP has open source drivers for their
> all-in-ones... if I can find one with automated pagefeeding, I am
> off to try clara...

Search the archives for my and other's discussions about project 
gutenbergs tests with gocr and other open source OCR programs.  They are 
all perfect with perfect texts, but basically horribly unusable with 
"typical" texts.  If the text is not perfectly straight with a great big 
font, i.e., printed with OCR in mind, gocr does an abysmal job -- 
whereas closed source OCR software got to the 95% accuracy with these 
"typical" tests in oh I don't know 1996.

The OCR software that comes with Microsoft Office beats the crap out of 
GOCR, even with cleanly printed books with nice fonts that you'd expect 
to be easy to scan.

What's missing in GOCR is a "slanted text straighter" algorithm.



Reply to: