Re: tesseract: ocr that works

To: debian-user@lists.debian.org
Subject: Re: tesseract: ocr that works
From: Anthony Campbell <ac@acampbell.org.uk>
Date: Sun, 28 Dec 2008 09:59:07 +0000
Message-id: <[🔎] 20081228095907.GF4682@acampbell.org.uk>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] gilnaj$ga$1@ger.gmane.org>
References: <[🔎] gilnaj$ga$1@ger.gmane.org>

On 21 Dec 2008, Hugo Vanwoerkom wrote:
> Hi,
>
> Recently there was a post mentioning tesseract.
>
> Turns out that is an award winning opensource OCR that works!
>
> I tried it out:
>
> 1. apt-get install tesseract-ocr
> 2. apt-get install tesseract-ocr-eng
> 3. use xsane to scan a page at dpi 300 and save as .tif
> 4. run: convert foo.tif -depth 8 foo1.tif
> 5. doit: tesseract foo1.tif foo2 -l eng
>
> And voilá! There is foo2.txt with the text.
>
> This is a page that I scanned:
> http://www.scribd.com/doc/9267859/p13x1
>
> This is the result:
> http://www.scribd.com/doc/9269769/p13
>
> The only errors where some punctuation marks.
>
> {2} tesseract comes by default with the German dic.
> [3] don't scan at less than 300 dpi
> [4] the result form xsane is depth 16 which tesseract can't handle so  
> you have to convert the result to depth 8.
>
> Hugo
>

As we seem to be reposting this, here are my comments again.

Yes, tesseract does work well. Here, xsane gives depth 24, but conversion
to depth 8 is neither possible nor necessary. Following the docs, I did
 
 export TESSDATA_PREFIX="/usr/share/tesseract-ocr/"

There was no need for "- l eng" since I only had the English version of
tesseract installed. So to scan a page saved at 300 dpi I just do:

	tesseract foo.dvi foo

The result is excellent. I got pretty good results with ocrad but
tesseract is definitely better.

Anthony



-- 
Anthony Campbell - ac@acampbell.org.uk 
Microsoft-free zone - Using Debian GNU/Linux
http://www.acampbell.org.uk (blog, book reviews, 
and sceptical articles)

Reply to:

Follow-Ups:
- Re: tesseract: ocr that works
  - From: andmalc <andmalc@gmail.com>

References:
- tesseract: ocr that works
  - From: Hugo Vanwoerkom <hvw59601@care2.com>

Prev by Date: Re: 64-bit Flash Player
Next by Date: Re: 64-bit Flash Player
Previous by thread: Re: tesseract: ocr that works
Next by thread: Re: tesseract: ocr that works
Index(es):
- Date
- Thread