[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#337084: tetex-base: latex/dvips/ps2pdf produces buggy pdf files



Ralf Stubner <ralf.stubner@web.de> wrote:

> Text-extraction from PDF is really complicated. If one adds a few
> interesting things (fi, ä, ß) to Frank's test file, one finds that
> pdftotext (best used via 'less <pdf-file>') that 'fi' is not found at
> all, 'ä' is found, 'ß' is found as 'ÿ', even when processed with
> pdflatex. IIRC there is some stage in the text-extraction where some
> default encoding (Latin-1 or something similar) is used. pdflatex
> probably includes the Type3 font with an encoding equivalent to T1. Now
> the code position of 'fi' in T1 is not defined in Latin-1, the code
> position of 'ß' in T1 is 'ÿ' in Latin-1, the code position of 'ä' is the
> same in both. So this fits. I guess that ghostscript changes the
> encoding of the Type3 font when creating the PDF, which makes text
> extraction rather meaningless. If one uses Type1 fonts, ghostscript is
> probably able to use a sensible encoding based on the glyphnames in the
> font. 

That sounds all very sensible, *but*:  On dctt where this first came up
(Thread started by "Nils"),  several people said that they could use the
find function on pdf files - I assume they read the question properly
and used latex/dvips/ps2pdf.

Regards, Frank
-- 
Frank Küster
Inst. f. Biochemie der Univ. Zürich
Debian Developer



Reply to: