[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: pdftohtml



On Sun, Dec 19, 2004 at 01:39:49PM -0800, Karsten M. Self wrote:
> on Wed, Dec 01, 2004 at 01:48:33AM +0100, Gerard Robin (jag.robin18@wanadoo.fr) wrote:
> > Hello,
> > 
> > I have a few problems with pdftohtml (unstable) :
> > 
> > with one pdf file I get a suitable html file but with another one I get an unreadable html file.
> > 
> > I tried "pdftohtml -c -l 1 file.pdf"  but the output is always unreadable and I get the message:  
> > 
> > free(): invalid pointer 0x80f02e0!
> > Page-1
> > 
> > 
> > However xpdf (or gv) displays correctly this file.pdf.
> > 
> > I guess that the problem comes out of the feature of this pdf file and
> > I would like to know if it 
> 
> Note first that 'PDF' isn't a simple file format.  Some PDFs are little
> more than marked-up text, others are essentially large image files
> (scanned in faxes from lawyers, such as are posted to Groklaw, are
> infamous for this).
> 
> There are also a few different versions of the PDF and PS formats.
> 
> 
> If you can post or point to the file you're trying to convert, this
> could be helpful.  Knowing how that file was created and with what
> tools, ditto.
> 
> 'ps2ps' on a Postscript file sometimes works around bugs that stymie
> some viewers (or printers).  It's a roundabout way, but:
> 
>    pdf2ps file.pdf file.ps
>    ps2ps file.ps file-new.ps
>    ps2pdf file-new.ps file-new.pdf
>    pdftohtml file-new.pdf file-new.html
> 
> ...might get you somewhere.  Most likely, a really broken hash of a
> file.
> 
> 
> Alternatively, if the source of the PDF file is available, converting
> *it* to HTML directly should provide far superior results.

I have joined the pdftohtml-general list and I obtained part of the solution:

We have to copy the file /etc/xpdf/xpdfrc in our home directoty (.xpdfrc)and 
add int it the line:

unicodeMap Latin2 /usr/share/xpdf/latin2/Latin2.unicodeMap

After that, we must launch the command:

pdftohtml -enc Latin2 file.pdf

Normaly we expected a file: file.html, but I obtained : segmentation fault ;-)

I tried again pdftohtml -c -enc Latin2 file.pdf and then it works.

The result was better than with the command: pdftohtml file.pdf, but it was 
not perfect yet:

The accents are almost right except the &egrave and the &ecirc and the underline 
(image.png) which was not in the right place.

The user of the list pdftohtml-general who helped me was surprised that the command: 

pdftohtml -enc Latin2 file.pdf gave me segmentation fault whereas for him this command
worked fine.

He wondered if it was my OS (unstable) which had problem ? 

There is the link where the pdf file (cobjet.pdf) that I use is located:

http://perso.wanadoo.fr/aymeric.sabine/developpement/bibliotheque/c/libal.zip


thanks.
-- 
Gerard 



Reply to: