[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: pdf to text

Karsten M. Self wrote:
if you've installed xpdf-utils, use pdftptext.
the correct name is pdftotext. sorry.

Right.  Works only for a subset of PDF docs, as well, with results all
over the map, from excellent (for some US Supreme Court decisions I
rendered to text) to not at all (for scanned-in FAX TIFFs).

Perhaps I am missing your point, but how is the inability to extract text from a document which contains none reasonably a count against pdftotext?

pdftotext isn't an ocr application and does not claim to be (see the man page). It can't extract text from a document which does not contain any text, and a pdf document which is nothing but a series of TIFF images does not contain any text.

If processing pdfs of scan-images of faxes is something you regularly need to do, you might want to try first using pdfimages to extract the embedded images and then something like gocr to extract text from the extracted images.


Reply to: