tesseract: ocr that works

To: debian-user@lists.debian.org
Subject: tesseract: ocr that works
From: Hugo Vanwoerkom <hvw59601@care2.com>
Date: Sun, 21 Dec 2008 09:28:17 -0600
Message-id: <[🔎] gilnaj$ga$1@ger.gmane.org>

Hi,

Recently there was a post mentioning tesseract.

Turns out that is an award winning opensource OCR that works!

I tried it out:

1. apt-get install tesseract-ocr
2. apt-get install tesseract-ocr-eng
3. use xsane to scan a page at dpi 300 and save as .tif
4. run: convert foo.tif -depth 8 foo1.tif
5. doit: tesseract foo1.tif foo2 -l eng

And voilá! There is foo2.txt with the text.

This is a page that I scanned:
http://www.scribd.com/doc/9267859/p13x1

This is the result:
http://www.scribd.com/doc/9269769/p13

The only errors where some punctuation marks.

{2} tesseract comes by default with the German dic.
[3] don't scan at less than 300 dpi

[4] the result form xsane is depth 16 which tesseract can't handle soyou have to convert the result to depth 8.


Hugo

Reply to:

Follow-Ups:
- Re: tesseract: ocr that works
  - From: "Dotan Cohen" <dotancohen@gmail.com>
- Re: tesseract: ocr that works
  - From: Anthony Campbell <ac@acampbell.org.uk>
- Re: tesseract: ocr that works
  - From: Rainer Kluge <rkluge50@web.de>

Prev by Date: Re: howto read an audio CD?-> how to get ekiga and pulseaudio working?
Next by Date: Re: Video editing: impossible without transcoding? (was: Video editing)
Previous by thread: Re: howto read an audio CD?-> how to get ekiga and pulseaudio working?
Next by thread: Re: tesseract: ocr that works
Index(es):
- Date
- Thread