Re: pdftohtml
On Sun, Dec 19, 2004 at 01:39:49PM -0800, Karsten M. Self wrote:
> on Wed, Dec 01, 2004 at 01:48:33AM +0100, Gerard Robin (jag.robin18@wanadoo.fr) wrote:
> > Hello,
> >
> > I have a few problems with pdftohtml (unstable) :
> >
> > with one pdf file I get a suitable html file but with another one I get an unreadable html file.
> >
> > I tried "pdftohtml -c -l 1 file.pdf" but the output is always unreadable and I get the message:
> >
> > free(): invalid pointer 0x80f02e0!
> > Page-1
> >
> >
> > However xpdf (or gv) displays correctly this file.pdf.
> >
> > I guess that the problem comes out of the feature of this pdf file and
> > I would like to know if it
>
> Note first that 'PDF' isn't a simple file format. Some PDFs are little
> more than marked-up text, others are essentially large image files
> (scanned in faxes from lawyers, such as are posted to Groklaw, are
> infamous for this).
>
> There are also a few different versions of the PDF and PS formats.
>
>
> If you can post or point to the file you're trying to convert, this
> could be helpful. Knowing how that file was created and with what
> tools, ditto.
>
> 'ps2ps' on a Postscript file sometimes works around bugs that stymie
> some viewers (or printers). It's a roundabout way, but:
>
> pdf2ps file.pdf file.ps
> ps2ps file.ps file-new.ps
> ps2pdf file-new.ps file-new.pdf
> pdftohtml file-new.pdf file-new.html
>
> ...might get you somewhere. Most likely, a really broken hash of a
> file.
>
>
> Alternatively, if the source of the PDF file is available, converting
> *it* to HTML directly should provide far superior results.
I have joined the pdftohtml-general list and I obtained part of the solution:
We have to copy the file /etc/xpdf/xpdfrc in our home directoty (.xpdfrc)and
add int it the line:
unicodeMap Latin2 /usr/share/xpdf/latin2/Latin2.unicodeMap
After that, we must launch the command:
pdftohtml -enc Latin2 file.pdf
Normaly we expected a file: file.html, but I obtained : segmentation fault ;-)
I tried again pdftohtml -c -enc Latin2 file.pdf and then it works.
The result was better than with the command: pdftohtml file.pdf, but it was
not perfect yet:
The accents are almost right except the è and the ê and the underline
(image.png) which was not in the right place.
The user of the list pdftohtml-general who helped me was surprised that the command:
pdftohtml -enc Latin2 file.pdf gave me segmentation fault whereas for him this command
worked fine.
He wondered if it was my OS (unstable) which had problem ?
There is the link where the pdf file (cobjet.pdf) that I use is located:
http://perso.wanadoo.fr/aymeric.sabine/developpement/bibliotheque/c/libal.zip
thanks.
--
Gerard
Reply to: