[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#337084: tetex-base: latex/dvips/ps2pdf produces buggy pdf files



On Wed, Nov 02, 2005 at 19:10 +0100, Hilmar Preusse wrote:
> On 02.11.05 Frank Küster (frank@kuesterei.ch) wrote:
> > Hilmar Preusse <hille42@web.de> wrote:
> > > On 02.11.05 Frank Küster (frank@debian.org) wrote:
> 
> > >> it is not possible to search for anything (e.g. the letter a) in
> > >> Acrobat Reader.  Don't yet know why...
> > >> 
> > > Because the PS-Type3 fonts are used you activate using
> > > \usepackage[T1]{fontenc}. Either:
> > > - install cm-super
> > > - call \usepackage{lmodern}
> > > - call \usepackage{ae}
> > 
> > You mean, Acrobat Reader generally cannot search in Type 3 fonts?
> > Strange.
> > 
> Not sure anymore. When using pdflatex and the file above I can search
> the file (and use pdftotext), after running your commands it does not
> work. I suggest to contact dctt about that before filing a bug. I'm
> afraid this is FAD.
> When loading ae the way latex/dvips/ps2pdf works too.

Text-extraction from PDF is really complicated. If one adds a few
interesting things (fi, ä, ß) to Frank's test file, one finds that
pdftotext (best used via 'less <pdf-file>') that 'fi' is not found at
all, 'ä' is found, 'ß' is found as 'ÿ', even when processed with
pdflatex. IIRC there is some stage in the text-extraction where some
default encoding (Latin-1 or something similar) is used. pdflatex
probably includes the Type3 font with an encoding equivalent to T1. Now
the code position of 'fi' in T1 is not defined in Latin-1, the code
position of 'ß' in T1 is 'ÿ' in Latin-1, the code position of 'ä' is the
same in both. So this fits. I guess that ghostscript changes the
encoding of the Type3 font when creating the PDF, which makes text
extraction rather meaningless. If one uses Type1 fonts, ghostscript is
probably able to use a sensible encoding based on the glyphnames in the
font. 

cheerio
ralf




Reply to: