[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: post doesn't show up



On 25 Dec 2008, Hugo Vanwoerkom wrote:

[snip] 

>> The OCR is tesseract-ocr. These steps:
>>
>> 1. apt-get install tesseract-ocr
>> 2. apt-get install tesseract-eng
>> 3. use xsane to scan a page at 300 dpi and save as .tif
>> 4. but that will be depth 16 which tesseract can't handle so reduce the 
>> depth: convert foo.tif -depth 8 foo.x1.tif
>> 5. run tesseract: tesseract foo.x1.tif foo -l eng
>> 6. text will show up as foo.txt.
>>
>> Works faultlessly with me: I have problems with single quotes and 
>> dashes but he recognizes all words perfectly.
>>
[snip] 

I agree that tesseract does work remarkably well. However, I omit the
'convert' step because for me this gives an error:

"convert: Caution: quantization tables are too coarse for baseline JPEG.`JPEGLib'."

However, it seems to be unnecessary here. For me, xsane gives a 24-depth
image (not 16-depth) and tesseract seems to be happy with this. I also
omit "-l eng" since I didn't include any other languages when I
installed tesseract. As suggested in the documentation, I put 

'export TESSDATA_PREFIX="/usr/share/tesseract-ocr/" 

in .bashrc (note the final /). 

To make things work now I just do "tesseract foo.tif foo".

I'm impressed. I mentioned ocrad a few posts ago here; that works too,
but there are more errors than with tesseract.

Anthony

-- 
Anthony Campbell - ac@acampbell.org.uk 
Microsoft-free zone - Using Debian GNU/Linux
http://www.acampbell.org.uk (blog, book reviews, 
and sceptical articles)


Reply to: