[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

tesseract: ocr that works



Hi,

Recently there was a post mentioning tesseract.

Turns out that is an award winning opensource OCR that works!

I tried it out:

1. apt-get install tesseract-ocr
2. apt-get install tesseract-ocr-eng
3. use xsane to scan a page at dpi 300 and save as .tif
4. run: convert foo.tif -depth 8 foo1.tif
5. doit: tesseract foo1.tif foo2 -l eng

And voilá! There is foo2.txt with the text.

This is a page that I scanned:
http://www.scribd.com/doc/9267859/p13x1

This is the result:
http://www.scribd.com/doc/9269769/p13

The only errors where some punctuation marks.

{2} tesseract comes by default with the German dic.
[3] don't scan at less than 300 dpi
[4] the result form xsane is depth 16 which tesseract can't handle so you have to convert the result to depth 8.

Hugo


Reply to: