[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: tesseract: ocr that works



On 28 Dec 2008, andmalc wrote:
> On Dec 28, 5:10 am, Anthony Campbell <a...@acampbell.org.uk> wrote:
> > On 21 Dec 2008, Hugo Vanwoerkom wrote:
> >
> [snip]
> 
> > Yes, tesseract does work well. Here, xsane gives depth 24, but conversion
> > to depth 8 is neither possible nor necessary. Following the docs, I did
> 
> There is an option at the top of the Preferences/Filetyple tab to save
> in 8-bit, but glad to know this isn't needed.
> 
> >  export TESSDATA_PREFIX="/usr/share/tesseract-ocr/"
> >
> > There was no need for "- l eng" since I only had the English version of
> > tesseract installed. So to scan a page saved at 300 dpi I just do:
> >
> >         tesseract foo.dvi foo
> >
> > The result is excellent. I got pretty good results with ocrad but
> > tesseract is definitely better.
> 
> I got poor results on a plain text sample, and much better using gocr
> with the same scan saved by xsane in pnm format.  I see your input
> file is a DVI.  Is that format yield better results than TIFF?  If so,
> how did you convert to that from the formats that xsane will save to?
> 
> Took me a while to figure out that tesseract will not read a TIFF if
> its file extension is 'tiff' instead of 'tif'.   Hadn't quite noticed
> that in the previous poster's instructions.
> 
> 

Sorry, that was a stupid slip; I meant tiff. And yes, you are right, the
termination has to be tif. I get v. poor results with gocr - unusable,
in fact. But ocrad is better though not as good as tesseract.


-- 
Anthony Campbell - ac@acampbell.org.uk 
Microsoft-free zone - Using Debian GNU/Linux
http://www.acampbell.org.uk (blog, book reviews, 
and sceptical articles)


Reply to: