[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: pdf to text

on Fri, Apr 30, 2004 at 02:00:53PM -0500, dircha (dircha@dircha.com) wrote:
> Karsten M. Self wrote:
> >>>if you've installed xpdf-utils, use pdftptext.
> >>the correct name is pdftotext. sorry.
> >
> >Right.  Works only for a subset of PDF docs, as well, with results all
> >over the map, from excellent (for some US Supreme Court decisions I
> >rendered to text) to not at all (for scanned-in FAX TIFFs).
> Perhaps I am missing your point, but how is the inability to extract 
> text from a document which contains none reasonably a count against 
> pdftotext?

No, not at all.

It's meant as a caution to a user who blindly expects all PDFs to be
renderable as text.  PDF isn't a single standardized format, except in
the grossest sense, but rather a wrapper for a whole host of sins.
While one can _sometimes_ get good results at attempting to reverse PDFs
to text, the results, as indicated, vary widely.

If anything, this is more an indictment of PDF and worse, of the abuses
some people subject it to.  Those scanned TIFFs I've spoken of
(encountered in my SCO vs. IBM case documents review & transcription)
can run to 40-50+ MiB for a few pages of printed output.  Barely a few
hundred KiB of actual text.

> pdftotext isn't an ocr application ...

Never said it was.

> If processing pdfs of scan-images of faxes is something you regularly 
> need to do

Not the intent of my post.  Merely that the ability to extract text from
a PDF varies greatly with the amount of text *in* a PDF, and its format
within same.


Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    Bush/Cheney '04: Putting the "con" in conservatism

Attachment: signature.asc
Description: Digital signature

Reply to: