[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Convert a pdf to text



On Thu, Sep 22, 2011 at 11:01 AM, Sharon Kimble <skimble04@gmail.com> wrote:
> I have a 96 page pdf file that I need to convert to text in one run.
> I've imported it into inkscape but that only converts one page at a
> time. I've tried using pdftotext but i cant work out the syntax for
> that so am unable to test it out properly. I've tried pdfedit but that
> only works on one page at a time and doesnt convert it to text.
>
> Can anyone help me out with suggestions for converting the pdf in one
> go to text please?
>
> Many thanks
> Sharon.
> --

Use pdftotext if you want it converted to plain text. Like this :
pdftotext -layout /path/to/pdffile.pdf /path/to/textfile.txt

or if you want it to be html (text only) :
pdftotext -format -htmlmeta /path/to/pdffile.pdf /path/to/textonlyHTMLfile.html

If you want to save images, colors and other formatting as well, then
you can convert only to html. Use pdtohtml for that.
Note that pdfto html is memory intensive.

To convert to a single html file for the content :
pdftohtml -p -nodrm /path/to/pdffile.pdf /path/to/htmlfile.html

this actually creates 3 html files  :
htmlfile.html - the main file to view
htmlfiles.html - the full converted single html file
htmlfile_ind.html - Navigation page.

To convert to multiple html files (one html file for each page) :
pdftohtml -c -p -nodrm /path/to/pdffile.pdf /path/to/htmlfile.html

this create 2 main files along with one html page for each page in the book :
htmlfile.html - the main file to view
htmlfile_ind.html - the navigation page

Keep in mind that pdftohtml is memory intensive and creating a single
paged html file is extremly memory intensive.
-- 
The mysteries of the Universe are revealed when you break stuff.


Reply to: