Re: pdf to text

To: Debian-User <debian-user@lists.debian.org>
Subject: Re: pdf to text
From: dircha <dircha@dircha.com>
Date: Fri, 30 Apr 2004 14:00:53 -0500
Message-id: <[🔎] 4092A265.7020700@dircha.com>
In-reply-to: <[🔎] 20040430061538.GH2152@ix.netcom.com>
References: <[🔎] 200404221221.18521.leva@az.isten.hu> <[🔎] 21675.195.55.91.129.1082630154.squirrel@llca512-a.servidoresdns.net> <[🔎] 7367.195.55.91.129.1082645419.squirrel@llca512-a.servidoresdns.net> <[🔎] 20040430061538.GH2152@ix.netcom.com>

Karsten M. Self wrote:

if you've installed xpdf-utils, use pdftptext.

the correct name is pdftotext. sorry.


Right.  Works only for a subset of PDF docs, as well, with results all
over the map, from excellent (for some US Supreme Court decisions I
rendered to text) to not at all (for scanned-in FAX TIFFs).

Perhaps I am missing your point, but how is the inability to extracttext from a document which contains none reasonably a count againstpdftotext?

pdftotext isn't an ocr application and does not claim to be (see the manpage). It can't extract text from a document which does not contain anytext, and a pdf document which is nothing but a series of TIFF imagesdoes not contain any text.

If processing pdfs of scan-images of faxes is something you regularlyneed to do, you might want to try first using pdfimages to extract theembedded images and then something like gocr to extract text from theextracted images.


dircha

Reply to:

References:
- pdf to text
  - From: LeVA <leva@az.isten.hu>
- Re: pdf to text
  - From: Diego Martínez Castañeda <dmartinez@keekorok.com>
- Re: pdf to text
  - From: Diego Martínez Castañeda <dmartinez@keekorok.com>
- Re: pdf to text
  - From: "Karsten M. Self" <kmself@ix.netcom.com>

Prev by Date: Re: sarge?
Next by Date: Re: sarge?
Previous by thread: Re: pdf to text
Next by thread: Problems with Xfree86 on Sarge
Index(es):
- Date
- Thread