[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: edit pdf's

Kevin Mark wrote:
On Tue, May 11, 2004 at 01:01:16PM -0400, Matt Price wrote:
thanks for the flues folks.  pdftohtml -- which I confess I *did*
already know about, sorry, should havesaid so -- won't work so well
for me, i odn't think;  these are scanned-in texts from the jstor
journal collection, and it's important I keep the pages in order...
as ,er, someone mentioned earlier (don't have the thread in front of
me at the moment), a complex process involving gimp and pdftops seems
to be the best bet, but it's insanely labour-intensive for long
documents, so I may forego the whole project. thx all though.

you mentioned something that caught my eye as it relates to a need in
FOSS that a friend of mine is looking for. A replacement for the
PAPERPORT product that allows for scanning in multipage docs, with the
ability to annotate pages, store ocr data with pages and to search the
archive as well as have a 'desktop environment app' that can show the
virtual folders of document with document thumbnails. PAPERPORT uses pdf
as their new format. Has anyone considered making such an apps? There
are many lawyer offices that would like this as well as people who deal
with large collections of document repositories.

I don't seem to have the root of this thread any longer.

However, have you looked into using pdfimages to extract the images and then gocr to extract the text from the images? You might want netpbm too if you go that route.


Reply to: