Re: pdf to text
Karsten M. Self wrote:
if you've installed xpdf-utils, use pdftptext.
the correct name is pdftotext. sorry.
Right. Works only for a subset of PDF docs, as well, with results all
over the map, from excellent (for some US Supreme Court decisions I
rendered to text) to not at all (for scanned-in FAX TIFFs).
Perhaps I am missing your point, but how is the inability to extract
text from a document which contains none reasonably a count against
pdftotext isn't an ocr application and does not claim to be (see the man
page). It can't extract text from a document which does not contain any
text, and a pdf document which is nothing but a series of TIFF images
does not contain any text.
If processing pdfs of scan-images of faxes is something you regularly
need to do, you might want to try first using pdfimages to extract the
embedded images and then something like gocr to extract text from the