Re: tesseract: ocr that works
On 21 Dec 2008, Hugo Vanwoerkom wrote:
> Hi,
>
> Recently there was a post mentioning tesseract.
>
> Turns out that is an award winning opensource OCR that works!
>
> I tried it out:
>
> 1. apt-get install tesseract-ocr
> 2. apt-get install tesseract-ocr-eng
> 3. use xsane to scan a page at dpi 300 and save as .tif
> 4. run: convert foo.tif -depth 8 foo1.tif
> 5. doit: tesseract foo1.tif foo2 -l eng
>
> And voilá! There is foo2.txt with the text.
>
> This is a page that I scanned:
> http://www.scribd.com/doc/9267859/p13x1
>
> This is the result:
> http://www.scribd.com/doc/9269769/p13
>
> The only errors where some punctuation marks.
>
> {2} tesseract comes by default with the German dic.
> [3] don't scan at less than 300 dpi
> [4] the result form xsane is depth 16 which tesseract can't handle so
> you have to convert the result to depth 8.
>
> Hugo
>
As we seem to be reposting this, here are my comments again.
Yes, tesseract does work well. Here, xsane gives depth 24, but conversion
to depth 8 is neither possible nor necessary. Following the docs, I did
export TESSDATA_PREFIX="/usr/share/tesseract-ocr/"
There was no need for "- l eng" since I only had the English version of
tesseract installed. So to scan a page saved at 300 dpi I just do:
tesseract foo.dvi foo
The result is excellent. I got pretty good results with ocrad but
tesseract is definitely better.
Anthony
--
Anthony Campbell - ac@acampbell.org.uk
Microsoft-free zone - Using Debian GNU/Linux
http://www.acampbell.org.uk (blog, book reviews,
and sceptical articles)
Reply to: