[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: problem mouse copy/past from PDF



Well, I did write in
https://lists.debian.org/debian-user/2016/09/msg00653.html
that "This is one area where a bit of experimentation will help much
more than trying to understand the scattered documentation."

On Tue 20 Sep 2016 at 15:08:58 (+0100), Brian wrote:
> On Mon 19 Sep 2016 at 22:41:23 -0500, David Wright wrote:
> 
> > On Sun 18 Sep 2016 at 16:14:37 (-0400), Haines Brown wrote:
> > > I've begun to experience problems using the mouse to select a passage in
> > > a PDF displayed with xpdf 3.03-10 in order to paste it elsewhere.
> > > 
> > > The ends of lines are truncated to varying degrees. For example in a
> > > PDF with this:
> > > 
> > >   123456789
> > >   123456789
> > >   1234567
> > > 
> > > The past might look like
> > > 
> > >   12345678
> > >   1234567
> > >   123456
> > 
> > Can you confirm that dragging your mouse produces a black rectangle,
> > and that the rectangle has the last digits (the ones that get lost)
> > highlighted thus.
> 
> Could be a possible cause. My mouse skills aren't brilliant and not
> precisely positioning the rectangle has often lead to my having to redo
> the copying.
> 
> What could also be tried is a search for '123456789'. Searching is just
> another form of text extraction. If it cannot be found a string cannot
> be copied correctly after highlighting it.

That's a good idea, and it seems to correlate with pdftotext's
behaviour but is much quicker.

> > My own experience is all or nothing. What I get correlates with the
> > output of pdftotext; if that can extract the text, I can copy it
> > with the mouse, if not then I can't. PDFs I produce with paps, for
> > example, don't work: I don't know why this is the case.
> 
> How do you produce a PDF using paps?

Sorry, missed out a step. The paps output is filtered through ps2pdf
so that could explain a lot. Thanks for reminding me. (The clue is in
the name!)

> > Actually, there is a third case: the pasted text is garbage. I think
> > this happens if the fonts are stripped of unused glyphs and then
> > packed into the minimum number of fonts to save memory. I may be
> > wrong here, though.
> 
> One table in a PDF stores character shapes (glyphs). This table is used
> by mupdf (say) to draw the page. mupdf does this without knowing that it
> is text; it is interested only in the shapess.
> 
> A second table (the ToUnicode map) is used to work out what the text
> says. The first table says that first shape in the word "Debian" looks
> like a "D". The second table says that that shape has a particular
> unicode value.
> 
> A defective or missing ToUnicode map has mupdf having no idea what the
> shapes mean, although it will render them them correctly on the screen
> or in print. So it resorts to a default mapping. The result is garbage
> for copy/paste. However, it can be logical garbage; every "D" becomes
> "X", every "b" a "P" etc. When searching, the string being looked for
> will not be found. ("Debian" is "XGP?yL", for example).
> 
>   https://github.com/angea/PDF101/tree/master/handcoded/textextract
> 
> is of interest.

Useful reference, thanks.

> > > Evince apparently does not support selecting text for copying. This does
> > > not happen on other machines.
> > 
> > My experience here is similar to xpdf but with a few differences: when
> > it works (the same files do), the selection is line by line (ie like
> > an xterm) rather than a strict rectangle; if it can't do it, it
> > doesn't highlight (whereas xpdf "lies": it highlights but fails to
> > copy); the highlighting may be coloured (white→blue, black→white) or
> > black (which hides the text).
> 
> Evince seems to be aware if *all* the text is not copiable and will then
> not allow it to be selected. It does not appear to be aware when only
> portions of a document are not copiable/searchable and these portions
> are selectable.

Well,   man xpdf   says baldly "Dragging the mouse with the left
button held down will highlight an arbitrary rectangle." I guess I
hadn't realised just how bald that rectangle can be.
It's tedious ascertaining anything about xpdf in the "jessie period"
because so much of it is broken; I have to repeat everything in
wheezy to make sure the problem is ephemeral. (Will these problems
go away?)

Cheers,
David.


Reply to: