Re: tesseract: ocr that works

To: debian-user@lists.debian.org
Subject: Re: tesseract: ocr that works
From: andmalc <andmalc@gmail.com>
Date: Sun, 28 Dec 2008 09:06:32 -0800 (PST)
Message-id: <[🔎] 9c8f719a-9798-4a07-984e-91f83d1eaa74@t3g2000yqa.googlegroups.com>
References: <[🔎] gilnaj$ga$1@ger.gmane.org> <[🔎] 20081228095907.GF4682@acampbell.org.uk>

On Dec 28, 5:10 am, Anthony Campbell <a...@acampbell.org.uk> wrote:
> On 21 Dec 2008, Hugo Vanwoerkom wrote:
>
[snip]

> Yes, tesseract does work well. Here, xsane gives depth 24, but conversion
> to depth 8 is neither possible nor necessary. Following the docs, I did

There is an option at the top of the Preferences/Filetyple tab to save
in 8-bit, but glad to know this isn't needed.

>  export TESSDATA_PREFIX="/usr/share/tesseract-ocr/"
>
> There was no need for "- l eng" since I only had the English version of
> tesseract installed. So to scan a page saved at 300 dpi I just do:
>
>         tesseract foo.dvi foo
>
> The result is excellent. I got pretty good results with ocrad but
> tesseract is definitely better.

I got poor results on a plain text sample, and much better using gocr
with the same scan saved by xsane in pnm format.  I see your input
file is a DVI.  Is that format yield better results than TIFF?  If so,
how did you convert to that from the formats that xsane will save to?

Took me a while to figure out that tesseract will not read a TIFF if
its file extension is 'tiff' instead of 'tif'.   Hadn't quite noticed
that in the previous poster's instructions.

Reply to:

Follow-Ups:
- Re: tesseract: ocr that works
  - From: Anthony Campbell <ac@acampbell.org.uk>

References:
- tesseract: ocr that works
  - From: Hugo Vanwoerkom <hvw59601@care2.com>
- Re: tesseract: ocr that works
  - From: Anthony Campbell <ac@acampbell.org.uk>

Prev by Date: Re: "CTRL + T" not working
Next by Date: Intel GM965 on Etch
Previous by thread: Re: tesseract: ocr that works
Next by thread: Re: tesseract: ocr that works
Index(es):
- Date
- Thread