I'm using ocropus and tesseract in Knoppix with a very good detection
rate. This is how it usually works (with ocropus version 0.3):
scanimage --mode lineart --resolution 300 | \
pnmflip -topbottom -leftright > scan.pnm
(I use pnmflip because my scanner scans from bottom to top, thus
producing an upside-down picture).
ocroscript recognize --tesslanguage=eng scan.pnm | \
sed 's,</span>,</span><br/>,g' | \
elinks -dump-width 79 -no-connect -force-html -no-numbering \
-no-references -dump > scan.txt"
(I use sed and elinks to produce a formatted plain text with correct
linebreaks to reflect the original layout. Just lines larger than 80
chars are also split for convenience when using a 40-letters braille
device for reading).
If you get an empty page, you should try again with the page turned
upside-down or turned 90/270 degrees to landscape. ocropus does not yet
detect page orientation on its own. There are some ways to autodetect
this by scripting, but all of them are slower than just retrying
with the picture rendered with a different orientation.
For most printed books, I get an error/misdetection rate below 2%, even
for multicolumn texts and two-page scanning, which is IMHO pretty good.
On Tue, Jan 19, 2010 at 07:35:27AM -0500, email@example.com wrote:
> Mario mentioned a while ago that he thought Ocropus was working well, unfortunitely my experience is that it recognizes exactly nothing on a page just sptting out a header and footer for a page without even an attempt at recognition. I am using ocropus the packages in testing, can anyone provide any advice?
- From: firstname.lastname@example.org