[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: problem mouse copy/past from PDF



On Mon 19 Sep 2016 at 22:41:23 -0500, David Wright wrote:

> On Sun 18 Sep 2016 at 16:14:37 (-0400), Haines Brown wrote:
> > I've begun to experience problems using the mouse to select a passage in
> > a PDF displayed with xpdf 3.03-10 in order to paste it elsewhere.
> > 
> > The ends of lines are truncated to varying degrees. For example in a
> > PDF with this:
> > 
> >   123456789
> >   123456789
> >   1234567
> > 
> > The past might look like
> > 
> >   12345678
> >   1234567
> >   123456
> 
> Can you confirm that dragging your mouse produces a black rectangle,
> and that the rectangle has the last digits (the ones that get lost)
> highlighted thus.

Could be a possible cause. My mouse skills aren't brilliant and not
precisely positioning the rectangle has often lead to my having to redo
the copying.

What could also be tried is a search for '123456789'. Searching is just
another form of text extraction. If it cannot be found a string cannot
be copied correctly after highlighting it.

> My own experience is all or nothing. What I get correlates with the
> output of pdftotext; if that can extract the text, I can copy it
> with the mouse, if not then I can't. PDFs I produce with paps, for
> example, don't work: I don't know why this is the case.

How do you produce a PDF using paps?

> Actually, there is a third case: the pasted text is garbage. I think
> this happens if the fonts are stripped of unused glyphs and then
> packed into the minimum number of fonts to save memory. I may be
> wrong here, though.

One table in a PDF stores character shapes (glyphs). This table is used
by mupdf (say) to draw the page. mupdf does this without knowing that it
is text; it is interested only in the shapess.

A second table (the ToUnicode map) is used to work out what the text
says. The first table says that first shape in the word "Debian" looks
like a "D". The second table says that that shape has a particular
unicode value.

A defective or missing ToUnicode map has mupdf having no idea what the
shapes mean, although it will render them them correctly on the screen
or in print. So it resorts to a default mapping. The result is garbage
for copy/paste. However, it can be logical garbage; every "D" becomes
"X", every "b" a "P" etc. When searching, the string being looked for
will not be found. ("Debian" is "XGP?yL", for example).

  https://github.com/angea/PDF101/tree/master/handcoded/textextract

is of interest.
 
> > Evince apparently does not support selecting text for copying. This does
> > not happen on other machines.
> 
> My experience here is similar to xpdf but with a few differences: when
> it works (the same files do), the selection is line by line (ie like
> an xterm) rather than a strict rectangle; if it can't do it, it
> doesn't highlight (whereas xpdf "lies": it highlights but fails to
> copy); the highlighting may be coloured (white→blue, black→white) or
> black (which hides the text).

Evince seems to be aware if *all* the text is not copiable and will then
not allow it to be selected. It does not appear to be aware when only
portions of a document are not copiable/searchable and these portions
are selectable.

-- 
Brian.


Reply to: