[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: tesseract: ocr that works



On 21 Dec 2008, Hugo Vanwoerkom wrote:
> Hi,
>
> Recently there was a post mentioning tesseract.
>
> Turns out that is an award winning opensource OCR that works!
>
> I tried it out:
>
> 1. apt-get install tesseract-ocr
> 2. apt-get install tesseract-ocr-eng
> 3. use xsane to scan a page at dpi 300 and save as .tif
> 4. run: convert foo.tif -depth 8 foo1.tif
> 5. doit: tesseract foo1.tif foo2 -l eng
>
> And voilá! There is foo2.txt with the text.
>
> This is a page that I scanned:
> http://www.scribd.com/doc/9267859/p13x1
>
> This is the result:
> http://www.scribd.com/doc/9269769/p13
>
> The only errors where some punctuation marks.
>
> {2} tesseract comes by default with the German dic.
> [3] don't scan at less than 300 dpi
> [4] the result form xsane is depth 16 which tesseract can't handle so  
> you have to convert the result to depth 8.
>
> Hugo
>

As we seem to be reposting this, here are my comments again.

Yes, tesseract does work well. Here, xsane gives depth 24, but conversion
to depth 8 is neither possible nor necessary. Following the docs, I did
 
 export TESSDATA_PREFIX="/usr/share/tesseract-ocr/"

There was no need for "- l eng" since I only had the English version of
tesseract installed. So to scan a page saved at 300 dpi I just do:

	tesseract foo.dvi foo

The result is excellent. I got pretty good results with ocrad but
tesseract is definitely better.

Anthony



-- 
Anthony Campbell - ac@acampbell.org.uk 
Microsoft-free zone - Using Debian GNU/Linux
http://www.acampbell.org.uk (blog, book reviews, 
and sceptical articles)


Reply to: